Abstract Paper Portal of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

PaperID: 1,   Oral  https://arxiv.org/pdf/2603.17662     GitHub GitHub
Authors: Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz
Title: FINER: MLLMs Hallucinate under Fine-grained Negative Queries
Abstract: Multimodal large language models (MLLMs) struggle with hallucinations, particularly with finegrained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduceFIne-grainedNEgative queRies (FINER), alongside two benchmarks:FINER-CompreCapandFINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we proposeFINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Benchmarks, training data, code and model checkpoints will be released.
PaperID: 2,   Oral  https://arxiv.org/pdf/2604.12502     GitHub GitHub
Authors: Junbin Su, Ziteng Xue, Shihui Zhang, Kun Chen, Weiming Hu, Zhipeng Zhang
Title: SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
Abstract: Parameterefficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Experiments show that AMG-LoRA alone establishes a remarkably simple yet strong baseline, outperforming SDSTrack on LasHeR by 3.3% in PR and 1.9% in SR with only 0.4% of its parameters (0.14M vs. 14.8M), while significantly boosting cross-modal fusion with negligible additional latency. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. Code will be released.
PaperID: 3,   Oral  https://arxiv.org/pdf/2604.17078     GitHub GitHub
Authors: Shangge Liu, Yuehan Yin, Lei Wang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao, Dacheng Tao
Title: Understanding and Enforcing Weight Disentanglement in Task Arithmetic
Abstract: Task arithmetic provides an efficient, trainingfree way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model (\theta_0) or the task vectors (\tau_t) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates (\Delta W) that constitute \tau_t during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods.
PaperID: 4,   Oral  https://arxiv.org/pdf/2511.10555     GitHub GitHub
Authors: Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang
Title: A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameterefficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.
PaperID: 5,   Oral  https://arxiv.org/pdf/2603.21229     GitHub GitHub
Authors: Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou, Songliang Cao, Meng Zhang, Hao Lu
Title: Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Abstract: Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the finegrained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. Our dataset couples instance-level point annotations with complete Linnaean labels (kingdom\rightarrowspecies) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The datasetfeatures 10,000 images with 678,090 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.
PaperID: 6,   Oral  https://arxiv.org/pdf/2512.01643     GitHub
Authors: Dongchen Han, Yining Li, Tianyu Li, Zixuan Cao, Ziming Wang, Jun Song, YuCheng YuCheng, Bo Zheng, Gao Huang
Title: ViT$^3$: Unlocking Test-Time Training in Vision
Abstract: TestTime Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key–value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT^3) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT^3 across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT^3 consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT^3 baseline can facilitate future work on visual TTT models. Code will be released.
PaperID: 7,   Oral  https://arxiv.org/pdf/2512.06818     GitHub
Authors: Jan Held, Sanghyun Son, Renaud Vandeghen, Daniel Rebain, Matheus Gadelha, Yi Zhou, Anthony Cioppa, Ming Lin, Marc Van Droogenbroeck, Andrea Tagliasacchi
Title: MeshSplatting: Differentiable Rendering with Opaque Meshes
Abstract: Primitivebased splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.
PaperID: 8,   Oral  https://arxiv.org/pdf/2604.10056     GitHub
Authors: Xunpei Sun, Wenwei Lin, Yi Chang, Gang Chen
Title: U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
Abstract: Existing unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U^2Flow, the first recurrent unsupervised framework that jointly estimates optical flow and perpixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U^2Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm.
PaperID: 9,   Oral  https://arxiv.org/pdf/2602.21963     GitHub
Authors: Tong Wei, Giorgos Tolias, Jiri Matas, Daniel Barath
Title: Global-Aware Edge Prioritization for Pose Graph Initialization
Abstract: The pose graph is a core component of Structurefrom-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its k nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. Code and models will be released.
PaperID: 10,   Oral  https://arxiv.org/pdf/2602.18993     GitHub
Authors: Jiwoo Chung, Sangeek Hyun, MinKyu Lee, Byeongju Han, Geonho Cha, Dongyoon Wee, Youngjun Hong, Jae-Pil Heo
Title: SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
Abstract: Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where lowfrequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.
PaperID: 11,   Oral  https://arxiv.org/pdf/2603.26181     GitHub
Authors: Youngju Na, Jaeseong Yun, Soohyun Ryu, Hyunsu Kim, Sung-Eui Yoon, Suyong Yeon
Title: GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
Abstract: While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scenescale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate that GLINT achieves state-of-the-art performance in 3D reconstruction of complex transparent scenes.Our code will be released publicly.
PaperID: 12,   Oral  https://arxiv.org/pdf/2511.20263     GitHub
Authors: Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad
Title: Advancing Image Classification with Discrete Diffusion Classification Modeling
Abstract: Image classification is a wellstudied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging.
PaperID: 13,   Oral  https://arxiv.org/pdf/2512.14692     GitHub
Authors: Jianfeng XIANG, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, Jiaolong Yang
Title: Native and Compact Structured Latents for 3D Generation
Abstract: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called OVoxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
PaperID: 14,   Oral  https://arxiv.org/pdf/2512.09201     GitHub
Authors: Aditya Ganeshan, Matheus Gadelha, Thibault Groueix, Zhiqin Chen, Siddhartha Chaudhuri, Vladimir G. Kim, Wang Yifan, Daniel Ritchie
Title: Residual Primitive Fitting of 3D Shapes with SuperFrusta
Abstract: We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent tradeoff between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
PaperID: 15,   Oral  https://arxiv.org/pdf/2602.15989     GitHub
Authors: Xitong Yang, Devansh Kukreja, Don Pinkus, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jia-Wei Liu, Nicolás Ugrinovic, Anushka Sagar, Jitendra Malik, Matt Feiszli, Piotr Dollár, Kris Kitani
Title: SAM 3D Body: Robust Full-Body Human Mesh Recovery
Abstract: We introduce SAM 3D Body (3DB), a promptable model for singleimage full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder–decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.
PaperID: 16,   Oral  https://arxiv.org/pdf/2602.22667     GitHub
Authors: Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen
Title: Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
Abstract: Openvocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian–language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released.
PaperID: 17,   Oral  https://arxiv.org/pdf/2511.22357     GitHub
Authors: Zhenglin Zhou, Fan Ma, Chengzhuo Gui, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-seng Chua
Title: AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
Abstract: Trainingfree 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. The code and models will be made publicly available.
PaperID: 18,   Oral  https://arxiv.org/pdf/2511.21135     GitHub
Authors: Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen, Mingchao Sun, Junjun Hu, Shichao Xie, Yang Kuan, Pei Shi, Zhining Gu, Lu Liu, Honglin Han, Xiaolong Wu, Mu Xu, Yu Zhang
Title: SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
Abstract: Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for sociallyaware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.
PaperID: 19,   Oral  https://arxiv.org/pdf/2512.08924     GitHub
Authors: Chuhan Zhang, Guillaume LE MOING, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi
Title: Efficiently Reconstructing Dynamic Scenes one D4RT at a Time
Abstract: Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatiotemporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
PaperID: 20,   Oral  https://arxiv.org/pdf/2512.16564     GitHub
Authors: Kirill Mazur, Marwan Taher, Andrew J. Davison
Title: 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motiongrouping techniques to maintain continuity.The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.
PaperID: 21,   Oral  https://arxiv.org/pdf/2512.09583     GitHub
Authors: Alberto Rota, Mert Kiray, Mert Asim Karaoglu, Patrick Ruhkamp, Elena De Momi, Nassir Navab, Benjamin Busam
Title: UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
Abstract: Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGBonly framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.
PaperID: 22,   Oral  https://arxiv.org/pdf/2512.10840     GitHub
Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka
Title: PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometryaware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects.
PaperID: 23,   Oral  https://arxiv.org/pdf/2604.10546     GitHub
Authors: SHIYIN JIANG, Wei Long, Minghao Han, Zhenghao Chen, Ce Zhu, Shuhang Gu
Title: Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
Abstract: The proliferation of visual data under tight storage and bandwidth budgets makes extremely low–bitrate generative image compression increasingly important. Vector quantization (VQ) is compelling in this regime because codebooks encode crosschannel correlations and dataset-level semantics, enabling perceptually faithful reconstructions when bits are scarce. We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. While end-to-end learned image codecs rely on a differentiable rate term for rate–distortion (RD) optimization, however, a key challenge is that naïvely integrating VQ introduces non-differentiability and is not directly compatible with entropy modeling, forcing prior work to regulate bitrate only indirectly. We resolve this by defining a distance-aware soft posterior over codebook indices and training a conditional autoregressive entropy model to predict it. Therefore the cross-entropy between the approximate and predicted posteriors yields a differentiable rate loss, restoring a gradient pathway from rate to the encoder via codeword distances. Such predicted codebook index distribution enables prefix-only transmission at inference, with the model imputing the rest of the indices, delivering retraining-free bitrate control over a practical range. Our end-to-end RD optimized RDVQ outperforms all baseline methods in terms of DISTS and CLIPIQA, which reflect superior structural restoration and better alignment with human visual perception on the Kodak, DIV2K and CLIC2020 datasets.
PaperID: 24,   Oral  https://arxiv.org/pdf/2603.19918     GitHub
Authors: Jizhou Han, Chenhao Ding, Yuhang He, Qiang Wang, Shaokun Wang, SongLin Dong, Yihong Gong
Title: Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Abstract: Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visualonly pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual–textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data.
PaperID: 25,   Oral  https://arxiv.org/pdf/2512.10958     GitHub
Authors: Alan Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Yu Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu
Title: WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduceWorldLens, a fullspectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further constructWorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and developWorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.
PaperID: 26,   Oral  https://arxiv.org/pdf/2512.08930     GitHub
Authors: Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai
Title: Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Abstract: Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structurefrom-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
PaperID: 27,   Oral  https://arxiv.org/pdf/2601.15408     GitHub
Authors: Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sanchez, Carlos Hinojosa, Denis Parra, Alvaro Soto, Bernard Ghanem
Title: CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Abstract: Medical vision–language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present "CURE", an erroraware curriculum learning framework that improves grounding and report quality without any additional data. CURE tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance emphasizing on harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability.
PaperID: 28,   Oral  https://arxiv.org/pdf/2603.22650     GitHub
Authors: Shiyao Li, Antoine Guédon, Shizhe Chen, Vincent Lepetit
Title: MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Abstract: Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy nextbest-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.
PaperID: 29,   Oral  https://arxiv.org/pdf/2503.06940     GitHub
Authors: Jianxiong Gao, Yichang Liu, baofeng yang, Jianfeng Feng, Yanwei Fu
Title: CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing
Abstract: Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first largescale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception.
PaperID: 30,   Oral  https://arxiv.org/pdf/2304.02296     GitHub
Authors: Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou
Title: Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
Abstract: In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient deduplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.
PaperID: 31,   Oral  https://arxiv.org/pdf/2601.18336     GitHub
Authors: Isaac Deutsch, Nicolas Moënne-Loccoz, Gavriel State, Žan Gojčič
Title: PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
Abstract: Multiview 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available.
PaperID: 32,   Oral  https://arxiv.org/pdf/2512.13192     GitHub
Authors: Zhuo Chen, Chengqun Yang, Zhuo Su, Zheng Lv, Jingnan Gao, Xiaoyuan Zhang, Xiaokang Yang, Yichao Yan
Title: POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of largescale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining “chicken-and-egg’’ cycle for scalable and reproducible portrait illumination.
PaperID: 33,   Oral  https://arxiv.org/pdf/2604.08924     GitHub
Authors: Zengyi Yang, Yu Liu, Juan Cheng, Zhiqin Zhu, Yafei Zhang, Huafeng Li
Title: Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
Abstract: Infraredvisible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote accurate semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability.
PaperID: 34,   Oral  https://arxiv.org/pdf/2511.22950     GitHub
Authors: Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou
Title: RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise construction of digital twins and world models for robotic applications, supports robotcentric data augmentation, and provides reliable cues for extracting robot actions and poses. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
PaperID: 35,   Oral  https://arxiv.org/pdf/2511.12691     GitHub
Authors: Shuaike Shen, Ke Liu, Jiaqing Xie, Shangde Gao, Chunhua Shen, Ge Liu, Mireia Crispin-Ortuzar, Shangqi Gao
Title: R$^2$-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
Abstract: Foundation models for medical image segmentation struggle under outof-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R^2-Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R^2-Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.
PaperID: 36,   Oral  https://arxiv.org/pdf/2509.24421     GitHub
Authors: Yuanyuan Gao, YUNING GONG, Yifei Liu, Li Jingfeng, Dan Xu, Yanci Zhang, Dingwen Zhang, Xiao Sun, Zhihang Zhong
Title: Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
Abstract: 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLPbased variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000×1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed than the original 3DGS. Specifically, it achieves more than 2.5× speedup over Octree-GS, and consistently delivers substantially higher rendering quality.
PaperID: 37,   Oral  https://arxiv.org/pdf/2602.19134    
Authors: Lord Sen, Shyamapada Mukherjee
Title: Mapping Networks
Abstract: The escalating parameter counts in modern deep learningmodels pose a fundamental challenge to efficient trainingand resolution of overfitting. We address this by introducingthe Mapping Networks which replace the high dimensionalweight space by a compact, trainable latent vector based onthe hypothesis that the trained parameters of large networksreside on smooth, lowdimensional manifolds. Henceforth,the Mapping Theorem enforced by a dedicated MappingLoss, shows the existence of a mapping from this latentspace to the target weight space both theoretically and inpractice. Mapping Networks significantly reduce overfittingand achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with99.5%, i.e., around 500× reduction in trainable parameters.
PaperID: 38,   Oral  https://arxiv.org/pdf/2604.13596    
Authors: Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, wenjun wu, Si Liu
Title: VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Abstract: Instancelevel object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, coarse point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong zero-shot generalization. On the challenging Ego–Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego→Exo and Exo→Ego tasks, respectively, significantly outperforming prior methods. Notably, our zero-shot model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.
PaperID: 39,   Oral  https://arxiv.org/pdf/2512.08282    
Authors: Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
Title: PAVAS: Physics-Aware Video-to-Audio Synthesis
Abstract: Recent advances in Videoto-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object–object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.
PaperID: 40,   Oral  https://arxiv.org/pdf/2602.23574    
Authors: Ruxiao Duan, Alex Wong
Title: Evidential Neural Radiance Fields
Abstract: Understanding sources of uncertainty is fundamental to trustworthy threedimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.
PaperID: 41,   Oral  https://arxiv.org/pdf/2603.09285    
Authors: Yuezhi Yang, Qixing Huang, Mikaela Angelina Uy, Nicholas Sharp
Title: Learning Convex Decomposition via Feature Fields
Abstract: This work proposes a new formulation to the longstanding problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decomposition.Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats.
PaperID: 42,   Oral  https://arxiv.org/pdf/2512.01103    
Authors: Roy Velich, Arkadi Piven, David Bensaid, Daniel Cremers, Thomas Dagès, Ron Kimmel
Title: Learning Eigenstructures of Unstructured Data Manifolds
Abstract: We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers.Grounded in optimalapproximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension.On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.
PaperID: 43,   Oral  https://arxiv.org/pdf/2511.22039    
Authors: Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen
Title: SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Abstract: This paper introduces a novel architecture for trajectoryconditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1‒3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.
PaperID: 44,   Oral  https://arxiv.org/pdf/2509.14476    
Authors: Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang
Title: AToken: A Unified Tokenizer for Vision
Abstract: We present AToken, the first unified visual tokenizer that achieves both highfidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.
PaperID: 45,   Oral  https://arxiv.org/pdf/2603.14001    
Authors: Jiale Wu, Xiaoyang Bai, Zongqi He, Weiwei Xu, YIFAN PENG
Title: PhyGaP: Physically-Grounded Gaussians with Polarization Cues
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment viadeferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support highfidelity relighting. Observing that this limitation stems from the lack ofshape and materialinformation in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.
PaperID: 46,   Oral  https://arxiv.org/pdf/2603.16732    
Authors: Ziquan Zhu, Gaojie Jin, Hanruo Zhu, Si-Yuan Lu, Yunxiao Zhang, ZEYU FU, Ronghui Mu, Guoqiang Zhang, Zhao Sun, Xia Yuhang, Jiaxing Shang, Xiang Li, Lu Liu, Tianjin Huang
Title: Confusion-Aware Spectral Regularizer for Long-Tailed Recognition
Abstract: Longtailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by 2.37% ~ 4.83%) and the fine-tuning-from-pretrained setting (by 2.42% ~ 4.17%) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.
PaperID: 47,   Oral  https://arxiv.org/pdf/2511.16624    
Authors: Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jia-Wei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing "Jed" Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik
Title: SAM 3D: 3Dfy Anything in Images
Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a humanand model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
PaperID: 48,   Oral  https://arxiv.org/pdf/2511.18787    
Authors: Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth Balasubramanian
Title: Understanding Task Transfer in Vision-Language Models
Abstract: Vision–Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making taskspecific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer & risks of negative interference, offering actionable guidance for advancing VLMs.
PaperID: 49,   Oral  https://arxiv.org/pdf/2602.22639    
Authors: Daniel Miao, Gilad Lerman, Joe Kileel
Title: QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
Abstract: In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover n cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of n. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higherorder information in synchronization.
PaperID: 50,   Oral  https://arxiv.org/pdf/2601.08832    
Authors: Fahad Shamshad, Nils Lukas, Karthik Nandakumar
Title: Erasing Invisible Watermarks via Novel View Synthesis
Abstract: Invisible watermarking has become a critical mechanism for authenticating AIgenerated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
PaperID: 51,   Oral  https://arxiv.org/pdf/2511.20645    
Authors: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
Title: PixelDiT: Pixel Diffusion Transformers for Image Generation
Abstract: Latentspace modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 2.21 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024^2 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
PaperID: 52,   Oral  https://arxiv.org/pdf/2603.13783    
Authors: Xuezhen Wang, Li Ma, Yulin Shen, Zeyu Wang, Pedro V. Sander
Title: RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
Abstract: Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slowmotion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow–guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.
PaperID: 53,   Oral  https://arxiv.org/pdf/2512.01989    
Authors: Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, Humphrey Shi
Title: PAI-Bench: A Comprehensive Benchmark For Physical AI
Abstract: Physical AI aims to develop models that can perceive and predict realworld dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
PaperID: 54,   Oral  https://arxiv.org/pdf/2602.19083    
Authors: Liangsi Lu, Xuhang Chen, Minzhe Guo, Shichu Li, Jingchao Wang, Yang Shi
Title: ChordEdit: One-Step Low-Energy Transport for Image Editing
Abstract: The advent of onestep text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce ChordEdit, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.
PaperID: 55,   Oral  https://arxiv.org/pdf/2511.21145    
Authors: Jiaming He, Guanyu Hou, Hongwei Li, Zhicong Huang, Kangjie Chen, Yi Yu, Wenbo Jiang, Guowen Xu, Tianwei Zhang
Title: TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
Abstract: Textto-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods, which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.
PaperID: 56,   Oral  https://arxiv.org/pdf/2603.03960    
Authors: Xiaohan Lei, Min Wang, Bohong Weng, Wengang Zhou, Houqiang Li
Title: Structural Action Transformer for 3D Dexterous Manipulation
Abstract: Achieving humanlevel dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties.Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective.We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks.Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer.This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.
PaperID: 57,   Oral  https://arxiv.org/pdf/2512.05564    
Authors: Zijun Wang, Panwen Hu, Jing Wang, Terry Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang
Title: ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Abstract: Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling largescale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
PaperID: 58,   Oral  https://arxiv.org/pdf/2601.02141    
Authors: Romain Vo, Julián Tachella
Title: Efficient unrolled networks for large-scale 3D inverse problems
Abstract: Deep learningbased methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.
PaperID: 59,   Oral  https://arxiv.org/pdf/2604.18811    
Authors: Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, R. Venkatesh Babu
Title: Rethinking Dataset Distillation: Hard Truths about Soft Labels
Abstract: Despite the perceived success of largescale dataset distillation (DD) methods, recent evidence \citeqin2024a finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L \citeyin2024squeezerecoverrelabeldataset due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED \citesun2024diversityrealismdistilleddataset reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.
PaperID: 60,   Oral  https://arxiv.org/pdf/2604.08048    
Authors: Weijia Zhang, Yuehao Liu, Shanyan Guan, Wu Ran, Yanhao Ge, Wei Li, Chao Ma
Title: Guiding a Diffusion Model by Swapping Its Tokens
Abstract: ClassifierFree Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling toward higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar tokens in either spatial or channel dimensions.Unlike existing methods that apply perturbation in a global or less constrained manner, our approach modifies only selected tokens, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO2014, MS-COCO 2017, and ImageNet datasets demonstrate that our Self-Swap Guidance (SSG), when applied to state-of-the-art diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.
PaperID: 61,   Oral  https://arxiv.org/pdf/2511.09715    
Authors: Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan A. Rossi, Soheil Feizi
Title: SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Abstract: Instructionbased image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits.We introduceSliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns asingleset of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. We are the first to explore and propose a framework for continuous, fine-grained instruction control in image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
PaperID: 62,   Oral  https://arxiv.org/pdf/2512.01629    
Authors: Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang
Title: SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulationready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.
PaperID: 63,   Oral  https://arxiv.org/pdf/2512.11016    
Authors: Haolin Yang, Jiayuan Rao, Haoning Wu, Weidi Xie
Title: SoccerMaster: A Vision Foundation Model for Soccer Understanding
Abstract: Soccer understanding has recently garnered growing research interest due to its domainspecific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we presentSoccerMaster, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework viasupervised multi-task pretraining;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed asSoccerFactory, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.
PaperID: 64,   Oral  https://arxiv.org/pdf/2602.24208    
Authors: Yasaman Haghighi, Alex Alahi
Title: SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Abstract: Diffusion models achieve stateof-the-art video generation but their many sequential denoising steps create a major computational bottleneck. Existing acceleration methods reuse cached model outputs at fixed timesteps chosen through heuristics, requiring heavy tuning and failing to adapt to each sample’s complexity. We address this with a principled, sensitivity-aware caching framework. We first formalize the caching problem by analyzing the network's output sensitivity with respect to changes in its inputs—namely, the noisy latent and the timestep. We demonstrate that this sensitivity is the key indicator of caching error. Building on this insight, we introduce Sensitivity-Aware Caching (\textSenCache), a dynamic strategy that adaptively selects which timesteps to cache on a per-sample basis. This allows for less caching on challenging samples and more aggressive acceleration on simpler ones. Our method provides a robust theoretical grounding for adaptive caching, offering an explanation for why previous empirical criteria are partially effective and extending them with a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX and LTX-Video models demonstrate that our method outperforms existing caching strategies in visual quality under similar computational budgets.
PaperID: 65,   Oral  https://arxiv.org/pdf/2509.00269    
Authors: Maria Parelli, Michael Oechsle, Michael Niemeyer, Federico Tombari, Andreas Geiger
Title: 3D-LATTE: Latent Space 3D Editing from Textual Instructions
Abstract: Despite the recent success of multiview diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Code will be publicly released.
PaperID: 66,   Oral  https://arxiv.org/pdf/2603.27176    
Authors: Woohyeon Park, Jaeik Kim, Sunghwan Cho, Pa Hong, Woo Kyoung Jeong, Yoojin Nam, Nam-Joon Kim, Ginny Wong, Ka Chun Cheung, Jaeyoung Do
Title: Medic-AD: : Towards Medical Vision-Language Model's Clinical Intelligence
Abstract: Lesion detection, symptom tracking, and visual explainability are central to realworld medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.
PaperID: 67,   Oral  https://arxiv.org/pdf/2511.23369    
Authors: Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, junli wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Long Chen, Hongyang Li
Title: Learning to Drive via Real-World Simulation at Scale
Abstract: Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safetycritical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing these crucial massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by ego trajectory perturbations. Furthermore, we develop a pseudo-expert trajectory generation mechanism to provide feasible action supervision for these newly simulated states to provide action supervision.Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal crucial findings of such a sim-real paradigm, includingthe design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code would be released.
PaperID: 68,   Oral  https://arxiv.org/pdf/2602.19213    
Authors: Yujie Lu, Jingwen Li, Sibo Ju, Yanzhou Su, He Yao, Yisong Liu, Min Zhu, Junlong Cheng
Title: SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
Abstract: Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixellevel annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM’s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.
PaperID: 69,   Oral  https://arxiv.org/pdf/2506.02387    
Authors: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
Title: VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to singleagent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human studies, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.
PaperID: 70,   Oral  https://arxiv.org/pdf/2601.09265    
Authors: Bei Huang, Yixin Chen, Ruijie Lu, Gang Zeng, Hongbin Zha, Yuru Pei, Siyuan Huang
Title: GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for highfidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.
PaperID: 71,   Oral  https://arxiv.org/pdf/2511.04029    
Authors: Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, Choon Hwai Yap
Title: FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE
Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on isosurface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the 10^-5 level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.
PaperID: 72,   Oral  https://arxiv.org/pdf/2604.19420    
Authors: Jaroslav Moravec, Radim Sara, Akihiro Sugimoto
Title: TESO: Online Tracking of Essential Matrix by Stochastic Optimization
Abstract: Reliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for datadriven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectification quality, and the stereo depth consistency with respect to a 3D LiDAR. In the large-scale MAN TruckScenes dataset, TESO tracks drift with 0.12° precision in the rotation around Y, which is critical for stereo accuracy, while the other two rotation angles are tracked with five times better precision. Sequences with simulated drift are tracked with similar precision as the no-drift ones, suggesting that the tracker is unbiased. Applied to the KITTI dataset, TESO reported systematic inconsistencies in extrinsic parameters across all stereo pairs, confirming observations made by other authors. We verify that these errors were partly caused by intrinsic decalibration, which manifested in the contradictory performance of two metrics: The epipolar error and the depth estimation accuracy. With corrected calibration parameters, TESO improved its rotation precision around the hardest Y-axis by approximately twentyfold, reaching 0.025°. In the depth estimation, there was a fiftyfold improvement. Despite its lightweight nature, we show that the combination of SIFT features and the proposed TESO loss function achieves accuracy comparable to published single-frame methods that rely on neural network models.
PaperID: 73,   Oral  https://arxiv.org/pdf/2601.02427    
Authors: Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi Fan
Title: NitroGen: An Open Foundation Model for Generalist Gaming Agents
Abstract: We introduce NitroGen, a videoaction foundation model for generalist gaming agents, trained on 40,000 hours of gameplay videos across more than 1000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in success rates over models trained from scratch. We release the dataset, benchmark, and model weights to advance research on generalist embodied agents.
PaperID: 74,   Oral  https://arxiv.org/pdf/2512.17817    
Authors: Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Paudel, Martin R. Oswald
Title: Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Abstract: While 3DGS has emerged as a highfidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9× fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
PaperID: 75,   Oral  https://arxiv.org/pdf/2602.17535    
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
Title: LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Abstract: Medical visionlanguage models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced—high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \textttLATA (Laplacian-Assisted Transductive Adaptation), a training- and label-free refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image–image kNN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a failure-aware conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \textttLATA is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across three medical VLMs and nine downstream tasks, \textttLATA consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \textttLATA sharpens zero-shot predictions without compromising exchangeability.
PaperID: 76,   Oral  https://arxiv.org/pdf/2602.20630    
Authors: yepeng liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu
Title: From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
Abstract: Keypointbased matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the Track-quality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art keypoint detection and description methods.
PaperID: 77,   Oral  https://arxiv.org/pdf/2512.02715    
Authors: Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang
Title: GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
Abstract: Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling finegrained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.
PaperID: 78,   Oral  https://arxiv.org/pdf/2503.20184    
Authors: M. Kerem Aydin, Yi-Chun Hung, Jaclyn Pytlarz, Qi Guo, Emma Alexander
Title: Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack
Abstract: Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh tradeoffs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging using only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, and interpretable hyperspectral imaging.
PaperID: 79,   Oral  https://arxiv.org/pdf/2511.10979    
Authors: Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Hang Wu, Yiwei Wang
Title: PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
Abstract: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits framescale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.
PaperID: 80,   Oral  https://arxiv.org/pdf/2511.02483    
Authors: Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotnychenko, Xiao-Xiao Long, Marc Habermann, Christian Theobalt
Title: OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
Abstract: We introduce OLATverse, a largescale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data.
PaperID: 81,   Oral  https://arxiv.org/pdf/2512.09373    
Authors: Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
Title: FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
Abstract: Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and illposed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., \pi^3) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct FUSER's estimates through a denoising process over the joint SE(3)^N space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)^N variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.
PaperID: 82,   Oral  https://arxiv.org/pdf/2603.03907    
Authors: Zhichao Yang, Jianjie Wang, Zhixianhe Zhang, Pangu Xie, Xiangfei Sheng, Pengfei Chen, Leida Li
Title: Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
Abstract: Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as finegrained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method. Data and model will be made publicly available.
PaperID: 83,   Oral  https://arxiv.org/pdf/2603.03711    
Authors: Yuanming Cao, Chengqi Li, Wenbo He
Title: LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
Abstract: Local Differential Privacy (LDP) is the gold standard trust model for privacypreserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level \varepsilon-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.
PaperID: 84,   Oral  https://arxiv.org/pdf/2506.07565    
Authors: Jinlu Zhang, Zixi Kang, Libin Liu, Jianlong Chang, Qi Tian, Feng Gao, Yizhou Wang
Title: OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
Abstract: Musicdriven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.
PaperID: 85,   Oral  https://arxiv.org/pdf/2503.08703    
Authors: Yimeng Shan, Zhenbang Ren, Haodi Wu, Wenjie Wei, Rui-Jie Zhu, Shuai Wang, Dehao Zhang, Yichen Xiao, Jieyuan Zhang, Kexin Shi, Jingzhinan Wang, Jason Eshraghian, Haicheng Qu, Malu Zhang
Title: SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks
Abstract: Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for eventbased tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based Spike-Driven Tracking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61M parameters and 8.16mJ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.
PaperID: 86,   Oral  https://arxiv.org/pdf/2604.12113    
Authors: Minjae Lee, Sungwoo Hur, Soojin Hwang, Won Hwa Kim
Title: PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
Abstract: Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into incontext (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM’s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples.Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.
PaperID: 87,   Oral  https://arxiv.org/pdf/2512.22501    
Authors: Edwin Vargas, Jhon Lopez, Henry Arguello, Ashok Veeraraghavan
Title: NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
Abstract: Ensuring the authenticity and ownership of digital images is increasingly challenging as modern editing tools enable highly realistic forgeries. Existing image protection systems mainly rely on digital watermarking, which is susceptible to sophisticated digital attacks. To address this limitation, we propose a hybrid opticaldigital framework that incorporates physical authentication cues during image formation and preserves them through a learned reconstruction process. At the optical level, a phase mask in the camera aperture produces a Null-space Optical Watermark (NOWA) that lies in the Null Space of the imaging operator and therefore remains invisible in the captured image. Then, a Null-Space Network (NSN) performs measurement-consistent reconstruction that delivers high-quality protected images while preserving the NOWA signature.The proposed design enables tamper localization by projecting the image onto the camera's null space and detecting pixel-level inconsistencies. Our design preserves perceptual quality, resists common degradations such as compression, and establishes a structural security asymmetry: without access to the optical or NSN parameters, adversaries cannot forge the NOWA signature. Experiments with simulations and a prototype camera demonstrate competitive performance in terms of image quality preservation and tamper localization accuracy compared to state-of-the-art digital watermarking and learning-based authentication methods.
PaperID: 88,   Oral  https://arxiv.org/pdf/2603.01205    
Authors: Li Jin, Weikai Chen, Yujie Wang, Yingda Yin, Zeyu HU, Runze Zhang, Keyang Luo, Shengju Qian, Xin Wang, Xueying Qin
Title: CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling
Abstract: Openworld promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.
PaperID: 89,   Oral  https://arxiv.org/pdf/2512.05745    
Authors: Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, Hao Peng
Title: ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
Abstract: Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for textonly LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.
PaperID: 90,   Oral  https://arxiv.org/pdf/2511.18200    
Authors: Haoming Wang, Qiyao Xue, Wei Gao
Title: InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
Abstract: Modern visionlanguage models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.
PaperID: 91,   Oral  https://arxiv.org/pdf/2603.17684    
Authors: Xingxing Xie, Jiahua Dong, Junwei Han, Gong Cheng
Title: Does YOLO Really Need to See Every Training Image in Every Epoch?
Abstract: YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly timeconsuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently.Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Medium training images are partially selected, with priority given to recently unused ones and the remaining quota filled randomly to ensure short-term coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones.On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than 1.43× training speedup for YOLO-series detectors (e.g., YOLOv8, YOLOv10, YOLO11, YOLO12) while also improving detection accuracy.
PaperID: 92,   Oral  https://arxiv.org/pdf/2511.19661    
Authors: Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, Bryan Wang
Title: CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Abstract: Agentic vision–language models are increasingly trained to “think with images” by calling image operations. However, we show that high finalanswer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.
PaperID: 93,   Oral  https://arxiv.org/pdf/2512.00255    
Authors: Kunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler, Rishabh Dabral, Marc Habermann, Christian Theobalt
Title: Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Abstract: We presentRelightable Holoported Characters(RHC), a novel personspecific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method’s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.
PaperID: 94,   Oral  https://arxiv.org/pdf/2601.10611    
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Rohun Tripathi, Sangho Lee, Reza Salehi, Jason Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Ali Farhadi, Ranjay Krishna
Title: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Abstract: Today’s strongest videolanguage models (VLMs) remain proprietary.The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe.As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.Crucially, many downstream applications require more than just high-level video understanding; they require grounding—either by pointing or by tracking in pixels. Even proprietary models lack this capability.We present Molmo2, a new family of VLMs that are state-of-the-art amongst open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks.Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs.We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme and show bi-directional attention on vision tokens and a novel token-weight strategy improve performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 outperforms larger proprietary models, including 32.9% (Molmo2) vs 17% (Gemini 2.5 Pro) on video pointing.
PaperID: 95,   Oral  https://arxiv.org/pdf/2512.20770    
Authors: Markus Gross, Sai Bharadhwaj Matha, Aya Fahmy, Rui Song, Daniel Cremers, Henri Meeß
Title: OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Abstract: Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and pervoxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework that is based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by projecting annotated 2D masks into the reconstructed 3D point cloud, thereby minimizing manual 3D annotation effort. Finally, we benchmark several state-of-the-art SSC methods on OccuFly using standard metrics, and highlight challenges specific to aerial viewpoints, yielding a comprehensive aerial vision benchmark that fosters holistic aerial 3D scene understanding.
PaperID: 96,   Oral  https://arxiv.org/pdf/2511.15622    
Authors: Dante Wasmuht, Otto Brookes, Maximilian Schall, Pablo Palencia, Christopher Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Toh, Adam Elzinga, Jason Allan Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Dídac Surís
Title: The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multianimal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity -- leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild. The dataset is available at [ANONYMIZED]
PaperID: 97,   Oral  https://arxiv.org/pdf/2508.04728    
Authors: Shuo Chen, Yijin Li, Xi Zheng, Guofeng Zhang
Title: Neural FieldBased 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
Abstract: The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learningbased approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 μm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability.
PaperID: 98,   Oral  https://arxiv.org/pdf/2510.10113    
Authors: Yuxi Mi, Qiuyang Yuan, Zhizhou Zhong, Xuan Zhao, Jiaogen Zhou, Fubao Zhu, Jihong Guan, Shuigeng Zhou
Title: ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications
Abstract: Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured offaxis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 564 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.
PaperID: 99,   Oral  https://arxiv.org/pdf/2509.03951    
Authors: Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Title: ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Outof-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
PaperID: 100,   Oral  https://arxiv.org/pdf/2507.12156    
Authors: Chen Li, Shanshan Dong, Sheng Qiu, Jianmin Han, Yibo Zhao, Zan Gao, Taku Komura, Kemeng Huang
Title: SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
Abstract: Reconstructing dynamic fluids from sparse views is a longstanding and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.
Paperid: 101,   Oral  
Authors: Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht
Title: VGGT-$\Omega$
Abstract: We present VGGTΩ, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-Ω with 20× more supervised data and 100× more unsupervised data than prior work, while requiring only 30% of VGGT’s memory and running 1.6× faster at inference. As a result, VGGT-Ω establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.
Paperid: 102,   Oral  
Authors: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth
Title: INSID3: Training-Free In-Context Segmentation with DINOv3
Abstract: Incontext segmentation (ICS) aims to segment arbitrary concepts, objects, parts, or personalized instances given a few annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but limits generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concept at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +6.1 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision.
Paperid: 103,   Oral  
Authors: Can Wang, Lei Liu, Wei Jiang, Dong Xu
Title: Z-Order Transformer for Feed-Forward Gaussian Splatting
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding realtime results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.
Paperid: 104,   Oral  
Authors: Cainan Davidson, Deva Ramanan, Neehar Peri
Title: RefAV: Towards Planning Centric Scenario Mining
Abstract: Autonomous Vehicles (AVs) collect and pseudolabel terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community.
Paperid: 105,   Oral  
Authors: Hanz Cuevas Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Eni Halilaj, Michael J. Black
Title: MAMMA: Markerless Accurate Multi-person Motion Acquisition
Abstract: We present MAMMA, a markerless motioncapture pipeline that accurately recovers SMPL-X parameters from multi-view video.Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialised hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks.The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, method, code, and model weights for research purposes.
Paperid: 106,   Oral  
Authors: Jeremy Juybari, Joshua Hamilton, Shuvra Das, Chaofan Chen, Andre Khalil, Yifeng Zhu
Title: Differentiable Laplacian Matrix Guided Superpixel Segmentation
Abstract: Superpixels partition an image into perceptually coherent regions, reducing the cost of downstream vision tasks. Modern deep learning methods excel at superpixel generation but often yield irregular boundaries and isolated pixels, necessitating nondifferentiable post-processing to enforce connectivity. This undermines the end-to-end learning capabilities. We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. The loss is model-agnostic and can be seamlessly integrated into the training of existing architectures to improve the quality of superpixels. In addition, we introduce two novel metrics, the average stray pixel count and excess component count, to measure the quality of superpixels. We demonstrate both qualitative and quantitative improvements over state-of-the-art methods with and without enforced connectivity. Our approach represents a significant step toward eliminating non-differentiable post-processing.
Paperid: 107,   Oral  
Authors: Ting Peng, Junhao Dong, Yew-Soon Ong
Title: Learning Latent Concepts for Detecting Out-of-Distribution Objects
Abstract: Detecting outof-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of `unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution~(ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.
Paperid: 108,   Oral  
Authors: Kai Zhu, Li Chen, Jun Cheng
Title: Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation
Abstract: Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domainspecific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) — a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.
Paperid: 109,   Oral  
Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chen-Wei Xie, Chongyang Zhong, Kai Zhu, Shen Tong, Lianghua Huang, Yu Liu, Yujiu Yang
Title: Weaver: Decoupled Training for Interleaved Multi-modal Generation
Abstract: Recent unified multimodal models have made unprecedented progress in understanding and generation, yet they largely support multi-modal inputs with single-modality outputs, struggling to produce complex interleaved text–image content due to data scarcity and the difficulty of modeling long-range cross-modal context. We introduce Weaver, which frames interleaved generation as an autoregressive planning–visualization process within a unified multi-modal architecture. A planner, i.e., understanding expert, digests rich text–image context to produce visualization triggers and their dense textual guidance except for plain text, while a visualizer, i.e., generation expert, produces images conditioned on the planner’s textual guidance and visual references. This design enables decoupled learning: we train the two experts on large collections of textual planning and reference-guided image data in parallel, yielding powerful interleaved multi-modal generation capability at inference. Moreover, training the planner with datasets from diverse understanding and generation tasks equips the model with automatic task inference. To analyze and evaluate the model from multiple dimensions, we further introduce a benchmark that covers a range of everyday use cases. Extensive experiments show that, even without or with only very limited real interleaved data training, Weaver achieves superior performance on interleaved multi-modal generation.
Paperid: 110,   Oral  
Authors: Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo
Title: S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
Abstract: Partlevel point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S^2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training.Extensive experiments demonstrate that S^2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.
Paperid: 111,   Oral  
Authors: Miao Jia, Xingchen Hu, Jiyuan Liu, Siwei Wang, Min Wang, Zijian Chen
Title: Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance
Abstract: Anchorbased multi-view clustering methods have gained significant attention for their effectiveness of handling large-scale datasets in recent years. The performance of these method is highly dependent on anchor quality.However, current methods neglect the interactive relationships among cross-view anchors, failing to effectively discover and exploit consistent and complementary information, leading to noisy or suboptimal anchor representations. In this paper, we propose a novel scalable tensorized anchor guidance for multi-view subspace clustering, which directly couples anchors across views to improve clustering performance. Specifically, we construct a third-order anchor tensor from view-specific anchors in a low-dimensional latent space. By imposing a tensor Schatten p-norm constraint on the anchor tensor, we can explicitly capture cross-view low-rank structure and jointly exploit consistency and complementarity information among anchors. Moreover, the tensorized anchor regularizer is independent of the number of samples, which reduces both time and space complexity. Experimental results on seven datasets demonstrate that SMVS-TAG achieves superior effectiveness and stability compared to state-of-the-art large-scale MVC methods.
Paperid: 112,   Oral  
Authors: Claudia Cuttano, Gabriele Trivigno, Carlo Masone, Stefan Roth
Title: MARCO: Navigating the Unseen Space of Semantic Correspondence
Abstract: Recent advances in semantic correspondence rely on dualencoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training.Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which extends sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences.MARCO sets a new state of the art on SPair-71k, AP-10K and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+10.3 PCK@0.01), strongest generalization to unseen keypoints (+3.8, SPair-U) and categories (+5.6, MP-100), while remaining 3× smaller and 10× faster than diffusion-based approaches.
Paperid: 113,   Oral  
Authors: Louis Martinez, Maks Ovsjanikov
Title: FILTR: Extracting Topological Features from Pretrained 3D Models
Abstract: Recent advances in pretraining 3D point cloud encoders (e.g., PointBERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
Paperid: 114,   Oral  
Authors: Ruofei Wang, Peiqi Duan, Ka Chun Cheung, Simon See, Boxin Shi, Renjie Wan
Title: Texvent: Asynchronous Event Data Simulation via Text Prompt
Abstract: Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Textto-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache–based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent’s superior computational efficiency and its ability to generate more realistic event data than existing simulators.
Paperid: 115,   Oral  
Authors: Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang
Title: NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
Abstract: Vision Transformers (ViTs) often need to be compressed for deployment on resourceconstrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69× pruning speedup and reduces pruning cost by up to 99.83%, with only a 0.61% average accuracy loss.
Paperid: 116,   Oral  
Authors: Wenjie Hou, Tianxiang Chen, Feng Wang, Tiantong Wu, Zhiming Zheng, Shaoting Tang, Wei Yang Bryan Lim
Title: FedAdamom: Adaptive Momentum for Improved Generalization in Federatedd Optimization
Abstract: Federated learning (FL) has emerged as a widely adopted training paradigm for privacypreserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.
Paperid: 117,   Oral  
Authors: Fu Feng, Yucheng Xie, Ruixiao Shi, Xu Yang, Jing Wang, Xin Geng
Title: Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
Abstract: Textto-image (T2I) diffusion models effectively produce semantically aligned images, but their reliance on training distributions constrains their capacity for synthesizing truly novel, out-of-distribution concepts. Existing methods attempt to enhance creativity through semantic exploration, such as fusing known concept pairs, but the resulting images remain linguistically describable and confined to familiar semantic spaces. Inspired by the soft probabilistic outputs of classifiers on novel or out-of-distribution inputs, we propose Distribution-Conditional Generation, a paradigm that models novel concepts as image synthesis conditioned on class distributions, enabling controllable yet semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder–decoder framework that unifies conditional and unconditional creative generation by decoding latent representations—either randomly sampled or mapped from conditions (e.g., class distributions)—into tokens representing novel concepts. DisTok is trained by iteratively sampling and fusing concept pairs from a dynamic pool to model progressively complex distributions, while enforcing semantic consistency through a vision-language model that aligns the class distributions of generated images with the input distributions. Extensive experiments demonstrate that DisTok enables efficient and flexible semantic exploration for token-level creative synthesis, achieving state-of-the-art text–image alignment and human preference.
Paperid: 118,   Oral  
Authors: Taci Ata Kucukpinar, Juan Mogollon, Joshua Fraser, Timothy Duff, Kannappan Palaniappan
Title: Linear Fundamental Matrix Estimation from 7 or 5 Points
Abstract: We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is wellknown, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SoTA) on multiple benchmarks.
Paperid: 119,   Oral  
Authors: Linxiao Shi, Siming Zheng, Zerong Wang, Hao Zhang, Jinwei Chen, Bo Li, Shifeng Chen, Peng-Tao Jiang
Title: Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
Abstract: Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learningbased methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. The code will be released publicly.
Paperid: 120,   Oral  
Authors: Takumi Kawano, Kohei Miura, Daisuke Iwai
Title: Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
Abstract: Conventional multiprojector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for large multi-projector environments.
Paperid: 121,   Oral  
Authors: Pengfei Hu, Meng Cao, Yingyao Wang, Yi Wang, Jiahua Dong, Jun Song, YuCheng YuCheng, Bo Zheng, Xiaodan Liang
Title: Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long-Video Understanding
Abstract: Long video understanding is essential for humanlike intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm—which alternates between global temporal reasoning and local frame examination—has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft’s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
Paperid: 122,   Oral  
Authors: Jiacong Zhou, Jiaxu Miao, Yourun Lin, Xianyun Wang, Jun Xiao, Jun Yu
Title: Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
Abstract: Aerial objectgoal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5% improvement in success rate over existing methods, validating the effectiveness of our design.
Paperid: 123,   Oral  
Authors: Ao Luo, XIN LI, Fan Yang, Yuezun Li, Zhaoquan Yuan, SHAN ZHAO, Bing Su, Xiao WU
Title: Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
Abstract: Modern optical flow estimation, though empowered by deep neural architectures, remains rooted in the discrete correspondence paradigm inherited from classical vision. Most networks infer frameto-frame displacements or correlation volumes, capturing where pixels move but not how motion evolves continuously through time. Yet physical motion in the real world follows smooth dynamics governed by underlying velocity fields, as long established in fluid mechanics and transport theory. To bridge this gap, we introduce Optical Flow Matching (OFM), a continuous formulation that learns a time-dependent velocity field to transport pixel coordinates along motion distribution coherent trajectories. A key component of our OFM is Triangle Velocities Synergy (TVS), a lightweight geometric mechanism that provides a stable and physically meaningful velocity construction, ensuring that continuous transport remains well-defined. Combined with an Euler-based ODE solver, OFM yields flow fields that are temporally smooth, geometrically consistent, and process-interpretable. Experiments on Sintel, KITTI, and Spring demonstrate that OFM achieves state-of-the-art accuracy, enhanced temporal stability, and notably stronger cross-dataset generalization, advancing optical flow estimation from correspondence inference to continuous dynamical reasoning. All code and trained models will be released upon acceptance to facilitate further research.
Paperid: 124,   Oral  
Authors: Hongyu Wen, Jia Deng
Title: SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
Abstract: Transparent objects are common in daily life, and understanding their multilayer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.67%.
Paperid: 125,   Oral  
Authors: Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari
Title: Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
Abstract: One of the most exciting applications of vision models involve pixellevel reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.
Paperid: 126,   Oral  
Authors: Yuxuan Liu, Wei Xu, Qi Guo
Title: MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
Abstract: We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurfacerefractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.
Paperid: 127,   Oral  
Authors: Jinyuan Liu, Ludan Sun, Tengyu Ma, Chunyan Yang, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Title: Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
Abstract: Infrared and visible video fusion is pivotal for robust perceptual systems, aiming to synthesize a comprehensive video stream that leverages both thermal resilience and textured details. However, prevailing methods, by treating video as independent frames, inherently introduce temporal incoherence, such as flickering and ghosting artifacts. While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. Our key insight is to distill the generative prior of a pretrained diffusion model into a one-step sampling framework, while explicitly modeling temporal dynamics. We design a memory-augmented latent pipeline where a temporal aggregation adapter aligns and propagates cross-frame features to ensure coherence, supported by a dedicated temporal consistency loss. This approach effectively decouples the challenge of achieving high fidelity from maintaining temporal stability. Extensive experiments on four benchmarks demonstrate that our method establishes a new state-of-the-art, generating fused videos with exceptional spatio-temporal consistency at a speed suitable for real-time application.
Paperid: 128,   Oral  
Authors: Yuxi Ma, Sujie Liu, Jing Yang, Jiacheng Wang, Yiping Chen, Baptiste Magnier, Liansheng Wang
Title: DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
Abstract: Largescale foundation models pretrained on massive datasets have demonstrated strong generalization capabilities in medical image analysis. However, they are typically trained on static datasets and struggle to cope with the continuously evolving nature of clinical data, where new imaging devices, institutions, and disease subtypes constantly emerge. While domain-incremental learning (DIL) provides a solution for sequential adaptation without revisiting historical data, existing methods typically assume fixed label spaces and limited domain heterogeneity, restricting their applicability to real-world clinical scenarios. To address these challenges, we propose DK-DDIL, a rehearsal-free framework for dynamic DIL that integrates two synergistic modules: a Dynamic Adaptation Module (DAM) employing dynamic rank selection and adaptive regularization to flexibly allocate model capacity under domain shifts, and a Knowledge Inheritance and Refinement (KIR) module that stabilizes cross-domain knowledge transfer through selective adapter fusion and prototype-level contrastive refinement. Experiments on the Skin Pathology Diagnosis dataset, the Cyst-X 3D MRI cohort, and the OfficeHome benchmark demonstrate that DK-DDIL consistently outperforms state-of-the-art DIL approaches, highlighting its effectiveness and versatility across dynamic 2D medical, 3D medical, and natural image domains.
Paperid: 129,   Oral  
Authors: Bingjun Luo, Jialin Guo, Yue Yao, Xinpeng Ding
Title: Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing contentbased jailbreaks are often inconsistent and show low attack success rates (ASR) against commercial closed-source MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability. That is, from the perspective of comprehension, MLLMs can robustly understand content regardless of visual style (e.g., "pencil sketch"). However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image. We apply a Group Relative Policy Optimization (GRPO) agent, guided by a Structurally-Tiered Reward Function. This function uniquely combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model, mapping outcomes to distinct, non-overlapping reward tiers to select the most potent stylistic parameters. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks. The GRPO agent automatically discovers optimal, non-intuitive parameters, demonstrating that stylistic biases are a scalable and modular vector for red-teaming MLLMs.
Paperid: 130,   Oral  
Authors: Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang
Title: Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
Abstract: The rapid advancement of diffusionbased image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data.Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training.Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status.However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data).Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality.In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models.We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.
Paperid: 131,   Oral  
Authors: Xiaoqi Li, Muhe Cai, Jiadong Xu, Juan Zhu, Hongwei Fan, Yan Shen, Guanghui Ren, Hao Dong
Title: AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models have significantly advanced robotic agents capable of executing diverse tasks; however, they remain limited in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment.To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations.Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s.Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks.
Paperid: 132,   Oral  
Authors: Changzhou Han, Wanlun Ma, XI TANG, Kun Hu, Sheng Wen, Yang Xiang
Title: BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
Abstract: Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether glossbased or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality.We presentBoostSLT, a context-aware framework enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces anEnergy-Aware Temporal Segmentation (EAT-Seg)module that dynamically partitions videos into semantically coherent fragments, followed by aDiffusion-based Semantic Reconstruction (DSR)module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and Rouge-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT.
Paperid: 133,   Oral  
Authors: Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang, Tianzhu Zhang
Title: ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
Abstract: Categorylevel object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations.Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors. Our method pioneers a new direction for future research by effectively and efficiently integrating shape completion into category-level object pose estimation. Code will be open.
Paperid: 134,   Oral  
Authors: Yu Gao, Su Lutong, Ruixiang Huang, Tianji Jiang, Jiadong Tang, Yufeng Yue, Yi Yang
Title: Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
Abstract: Highquality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-based 3DGS lacks controllable sampling and thus cannot support progressive pose refinement. To address these challenges, we redesign the optimization strategy of Gaussian primitives and introduce an image-energy-guided constraint that encourages progressive alignment of camera poses. Experiments on both synthetic and real-world datasets show that Energy-GS can effectively optimize the scene reconstruction and resolve camera pose misalignment at the same time. Benefiting from reliance on only RGB images, we believe this work provides promising insights for visual localization and dense mapping applications such as SLAM.
Paperid: 135,   Oral  
Authors: Dingkun Wei, Zehong Shen, Yan Xia, Yujun Shen, Georgios Pavlakos, Xiaowei Zhou
Title: Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
Abstract: Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable highorder temporal cues—velocity and acceleration—which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail.We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, velocities, and accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines camera-space and world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion profiles.Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.
Paperid: 136,   Oral  
Authors: Mohammadjavad Matinkia, Nilanjan Ray
Title: Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
Abstract: Diffeomorphic image registration (DIR) seeks topologypreserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduceSGDIR, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.
Paperid: 137,   Oral  
Authors: Sriram Narayanan, Mani Ramanagopal, Srinivasa G. Narasimhan
Title: Dual Band Video Thermography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
Abstract: Longwave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.
Paperid: 138,   Oral  
Authors: Zhicheng Liang, Haoyi Yu, Boyan Li, Dayou Zhang, Zijian Cao, Tianyi Gong, Junhua Liu, Shuguang Cui, Fangxin Wang
Title: 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
Abstract: Accurate 3D reconstruction of objects with reflective, transparent, or lowtexture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 million multi-view frames. It encompasses diverse materials, complex lighting conditions, and a wide range of geometric forms—including shapes generated from both real and LLM-synthesized 2D images using diffusion-based methods. To support robust evaluation, we design benchmarks for four core tasks: image matching, reflection removal, structure-from-motion, and novel view synthesis. Through extensive experiments, we show that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models. We release the dataset, baselines, and evaluation suite to facilitate progress in this direction, which can be accessed at supplementary materials.
Paperid: 139,   Oral  
Authors: Jeonggon Kim, Heejoon Moon, Je Hyeong Hong
Title: Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
Abstract: PrivacyPreserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints.However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks.In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points.We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack.DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location.This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed:Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery.DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines.Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
Paperid: 140,   Oral  
Authors: Linjie Qu, Jin Xiao, Xiangrong Liu, Changming Sun, Hui Cui, Yuqi Fang, Ran Su, Qiangguo Jin, leyi wei
Title: MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
Abstract: Multimodal learning approaches that integrate pathological images with genomic profiles have significantly enhanced the accuracy of survival prediction tasks. However, previous methods often struggle to effectively process long-range gigapixel whole slide images (WSIs) and sparse genomic profiles due to the limitations of conventional scanning strategies to serialize data and the complex and heterogeneous nature of the modalities. Inspired by recent advancements in Mamba and mixture of experts (MoE), we propose a novel multi-directional composite scanning strategy with mixture of attention and Mamba experts (MDCS-MoAME) for cancer survival prediction. Specifically, we introduce a multi-directional composite scanning (MDCS) strategy to both WSIs and genomic profiles, and use the Mamba encoder to process intra-modal representations at the region, patch, and gene level, ensuring sufficient utilization of the intrinsic information within each modality. To further capture heterogeneous inter-modal representations, we introduce mixture of attention and Mamba experts (MoAME), which dynamically selects tailored experts to model complex inter-modal correlations, flexibly focusing on the interactions between modalities. Finally, we introduce alignment constraints to recalibrate inter-modal interactions and reduce intra- and inter-modal representation redundancy, enhancing its discriminative power for comprehensive survival analysis. Experimental results on five publicly available datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. Our code is included in the supplementary material.
Paperid: 141,   Oral  
Authors: Shai Bagon, Matan Kichler, Mark Sheinin
Title: Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
Abstract: Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a specklebased vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.
Authors: Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu
Title: WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering
Abstract: Existing methods achieve highquality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.
PaperID: 143,   Poster  https://arxiv.org/pdf/2506.03079     GitHub GitHub GitHub
Authors: Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao
Title: ORV: 4D Occupancy-centric Robot Video Generation
Abstract: Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current actionconditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8% lower FVD than state of the art, +3.5% success rate on visual planning, and +6.4% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps. Code, models, and data will be released upon acceptance.
PaperID: 144,   Poster  https://arxiv.org/pdf/2511.19365     GitHub GitHub GitHub
Authors: Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
Title: DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Abstract: Pixel diffusion aims to generate images directly in pixel space in an endto-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of1.62(256×256) and2.22(512×512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison.
PaperID: 145,   Poster  https://arxiv.org/pdf/2512.16913     GitHub GitHub GitHub
Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
Title: Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Abstract: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a datain-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. Code and models will be publicly released.
PaperID: 146,   Poster  https://arxiv.org/pdf/2603.12382     GitHub GitHub GitHub
Authors: Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi, Muzammal Naseer
Title: SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
Abstract: Multimodal large language models (MLLMs) have advanced from imagelevel reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Code, datasets, and models will be released.
PaperID: 147,   Poster  https://arxiv.org/pdf/2602.19611     GitHub GitHub
Authors: Mingxiu Cai, Zhe Zhang, Gaochang Wu, Tianyou Chai, Xiatian Zhu
Title: RAID: Retrieval-Augmented Anomaly Detection
Abstract: Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intraclass variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce RAID, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. Code and models will be released.
PaperID: 148,   Poster  https://arxiv.org/pdf/2603.21295     GitHub GitHub
Authors: Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang, Taoran Yi, Zanwei Zhou, zhikuan bao, Lingxi Xie, Wei Shen, Qi Tian
Title: Text-Image Conditioned 3D Generation
Abstract: Highquality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that can create 3D content from user-provided prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models deliver high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. This restricts how users can express their intent and raises a natural question: can the two modalities be combined to yield more flexible and faithful 3D generation? Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text–Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification during generation. To address this task, we introduce TIGON, a minimalist dual-branch baseline that maintains separate image- and text-conditioned backbones with lightweight cross-modal fusion. Extensive experiments demonstrate that text–image conditioning yields consistent gains over single-modality methods, suggesting complementary vision–language guidance as a promising direction for future 3D generation research.
PaperID: 149,   Poster  https://arxiv.org/pdf/2603.19076     GitHub GitHub
Authors: Moyang Li, Zihan Zhu, Marc Pollefeys, Daniel Barath
Title: DROID-SLAM in the Wild
Abstract: We present a robust, realtime RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly released.
PaperID: 150,   Poster  https://arxiv.org/pdf/2603.00947     GitHub GitHub
Authors: Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
Title: Mobile-VTON: High-Fidelity On-Device Virtual Try-On
Abstract: Virtual tryon (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textscMobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textscMobile-VTON introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textscMobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024×768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
PaperID: 151,   Poster  https://arxiv.org/pdf/2602.20417     GitHub GitHub
Authors: Aryan Garg, Sizhuo Ma, Mohit Gupta
Title: gQIR: Generative Quanta Image Reconstruction
Abstract: Capturing highquality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw quanta frames contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging Deforming (XD) video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing.
PaperID: 152,   Poster  https://arxiv.org/pdf/2511.13019     GitHub GitHub
Authors: Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon
Title: MeanFlow Transformers with Representation Autoencoders
Abstract: MeanFlow (MF) is a diffusionmotivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF’s 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive one-step FID of 3.23 with the lowest GFLOPS among all baselines.Code and proofs are available in the supplementary material.
PaperID: 153,   Poster  https://arxiv.org/pdf/2506.18871     GitHub GitHub
Authors: Chenyuan Wu, Jiahao Wang, PengFei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ziyi Xia, Ze Liu, Chaofan Li, Haoge Deng, Kun Luo, Bo Zhang, Jiajun Zhang, Dong Liu, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Title: OmniGen2: Towards Instruction-Aligned Multimodal Generation
Abstract: Multimodal generative models can process instructions in various modalities and demonstrate outstanding performance across a wide range of image generation tasks. However, their robustness in complex realworld scenarios remains limited due to insufficient generalized instruction alignment. We introduces OmniGen2, a unified multimodal generator designed to follow complex, fine-grained instructions. Our core contribution is a two-stage design that first builds a strong, world-knowledge-grounded foundation model and then aligns it using a progressive, multi-task instruction tuning strategy. The foundation model features a streamlined architecture with decoupled decoding for versatile multimodal generation and a novel positional encoding scheme to improve learning efficiency. We ground this model in real-world knowledge using large-scale data construction pipelines. Building on this foundation, we propose a progressive, reinforcement-based alignment process. This phase carefully schedules training tasks and reward signals to foster cross-task knowledge transfer, significantly improving the model's instruction-following capabilities. Our models demonstrate competitive performance on standard benchmarks and our dedicated in-context generation benchmark, OmniContext. We will release our models, code, benchmark, and training datasets to catalyze future research in building more capable and instruction aligned generative models.
PaperID: 154,   Poster  https://arxiv.org/pdf/2510.14255     GitHub GitHub
Authors: Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng
Title: Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Abstract: Recent advances in imageto-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Code will be released.
PaperID: 155,   Poster  https://arxiv.org/pdf/2604.01081     GitHub GitHub
Authors: Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Paudel, Luc Van Gool, Kailun Yang
Title: ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
Abstract: 3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to longtailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRC_r by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code will be made publicly available.
PaperID: 156,   Poster  https://arxiv.org/pdf/2604.04063     GitHub GitHub
Authors: Junsheng Zhou, Zhifan Yang, Liang Han, Wenyuan Zhang, Kanle Shi, Shenkun Xu, Yu-Shen Liu
Title: 4C4D: 4 Camera 4D Gaussian Splatting
Abstract: This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novelview rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art.
PaperID: 157,   Poster  https://arxiv.org/pdf/2604.00267     GitHub GitHub
Authors: Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Rehg, Yapeng Tian
Title: Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Abstract: We introduce OmniMMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech. The task involves two tightly coupled goals: extracting identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to).This task is essential for developing AI assistants that can perceive and respond to human interactions.Unlike prior studies that assume identity-attributed social cues perfectly provided, Omni-MMSI reflects realistic scenarios where AI assistants must perceive from raw multi-modal streams and reason over extracted social cues.However, existing pipelines and multi-modal LLMs perform poorly in this setting because they lack reliable identity attribution ability, which leads to inaccurate social cues and weak interaction reasoning.To address this challenge, we propose Omni-MMSI-R, a reference-based pipeline that uses reference audio-vision pairs to produce identity-attributed social cues and leverages curated chain-of-thought supervision for reasoning on reference-based inputs. To enable reference-based research, we construct participant-level reference pairs and curated reasoning annotations on top of the existing datasets.Extensive experiments demonstrate that Omni-MMSI-R consistently outperforms advanced multi-modal LLMs and counterparts in Omni-MMSI.
PaperID: 158,   Poster  https://arxiv.org/pdf/2602.10113     GitHub GitHub
Authors: Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu
Title: ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
Abstract: Imageto-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual–geometric encoder as well as a text–visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios.
PaperID: 159,   Poster  https://arxiv.org/pdf/2503.14505     GitHub GitHub
Authors: Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steve Seitz
Title: MusicInfuser: Making Video Diffusion Listen and Dance
Abstract: We introduce MusicInfuser, an approach that aligns pretrained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.
PaperID: 160,   Poster  https://arxiv.org/pdf/2512.10284     GitHub GitHub
Authors: Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
Title: MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Abstract: We introduceMotionEdit, a novel dataset for motioncentric image editing—the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility.Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.To evaluate model performance on the novel task, we introduceMotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics.Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models.To address this gap, we proposeMotionNFT(Motion-guided Negative-aware FineTuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations.Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
PaperID: 161,   Poster  https://arxiv.org/pdf/2510.27234     GitHub GitHub
Authors: Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, Yichao Yan
Title: MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Abstract: Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, largescale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks.Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
PaperID: 162,   Poster  https://arxiv.org/pdf/2603.12760     GitHub GitHub
Authors: Xiaoyu Li, Yuhang Liu, xuanshuo kang, zheng luo, Fangqi Lou, 吴晓华 吴晓华, Zihan Xiong
Title: HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks
Abstract: InContext Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduceHigh-FidelityIn-ContextLearning (HiFICL) to more faithfully model the ICL mechanism. HiFICL consists of three key components: 1) a set of "virtual key-value pairs" injected into each attention head to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks.
PaperID: 163,   Poster  https://arxiv.org/pdf/2511.16671     GitHub GitHub
Authors: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
Title: Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e.,think, either before (as preplanning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself.In this preliminary study, we introduceThinking-while-Generating(TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs.To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning.We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation.
PaperID: 164,   Poster  https://arxiv.org/pdf/2512.16635     GitHub GitHub
Authors: Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang, Yilin Cheng, Haonan Guo, Wei Cui, Jing Zhang
Title: SARMAE: Masked Autoencoder for SAR Representation Learning
Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in allweather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training.Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available.
PaperID: 165,   Poster  https://arxiv.org/pdf/2603.27758     GitHub GitHub
Authors: Junwei Zheng, Ruize Dai, Ruiping Liu, Zichao Zeng, Yufan Chen, Fangjinhua Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Title: RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
Abstract: Metric CrossView Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. The dataset, model, and code will be made publicly available.
PaperID: 166,   Poster  https://arxiv.org/pdf/2512.15603     GitHub GitHub
Authors: Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei chen, Yuxiang Chen, Heung-Yeung Shum, Lionel Ni, Junyang Lin, Chenfei Wu
Title: Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
Abstract: Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose QwenImage-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.
PaperID: 167,   Poster  https://arxiv.org/pdf/2511.18452     GitHub GitHub
Authors: Loick Chambon, Paul Couairon, Éloi Zablocki, Alexandre Boulch, Nicolas THOME, Matthieu Cord
Title: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.
Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, which poses challenges for pixellevel tasks that require fine-grained details.Existing approaches face a trade-off: classical filters are fast and broadly applicable but use fixed forms and feature-independent guidance, while modern upsamplers achieve stronger accuracy with learnable, VFM-specific guidance but require retraining per VFM.We introduce Neighborhood Attention Filtering (NAF), bridging classical filtering with modern upsamplers. Guided solely by the high-resolution input image, NAF learns adaptive content and spatial weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE).NAF is VFM-agnostic and zero-shot: once trained, it upsamples features from any VFM without retraining, being the first VFM-agnostic architecture to outperform VFM-specific upsamplers by achieving state-of-the-art scores on multiple downstream tasks.It remains highly efficient, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.Beyond feature upsampling, NAF demonstrates strong performance on image restoration, showing its versatility. We open-source our code and checkpoints.
PaperID: 168,   Poster  https://arxiv.org/pdf/2505.24862     GitHub GitHub
Authors: Cailin Zhuang, Ailin Huang, Hu Yaoqi, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Abstract: Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character reference, or singleimage cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and are used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
PaperID: 169,   Poster  https://arxiv.org/pdf/2506.07998     GitHub GitHub
Authors: Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Title: Generative Modeling of Weights: Generalization or Memorization?
Abstract: Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate highperforming neural network weights during inference. In this work, we examine four representative, well-known methods in this emerging area on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further results suggest that the memorization potentially resulted from limited data, overparameterized models, and the underuse of structural priors specific to weight data. Our findings highlight the need for more careful design and evaluation of generative models in new domains.
PaperID: 170,   Poster  https://arxiv.org/pdf/2601.17468     GitHub GitHub
Authors: Chia-Ming Lee, Yu-Fan Lin, Jin-Hui Jiang, Yu-Jou Hsiao, Chih-Chung Hsu, Yu-Lun Liu
Title: ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Abstract: Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmissionreflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations.(1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup.Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization.
PaperID: 171,   Poster  https://arxiv.org/pdf/2603.25209     GitHub GitHub
Authors: Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang
Title: Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Abstract: Generating long videos using pretrained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose a novel training-free, layer-adaptive framework. The core of our approach is the observation that different layers within the model exhibit varying sensitivities to these two O.O.D. issues. We first introduce a systematic probing procedure to quantify each layer's sensitivity. Based on the results, we apply a tailored, layer-wise strategy. For layers sensitive to relative positions, we propose a novel multi-granularity video-based relative position re-encoding (VRPR) scheme. For layers sensitive to context length, we utilize a tiered sparse attention (TSA) mechanism combined with an attention sink. Extensive experiments show that our method achieves state-of-the-art performance in long video generation. Importantly, our framework can be seamlessly integrated into various leading video diffusion models without any additional training.
PaperID: 172,   Poster  https://arxiv.org/pdf/2512.07237     GitHub GitHub
Authors: Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai
Title: Unified Camera Positional Encoding for Controlled Video Generation
Abstract: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in threedimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduceRelative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective forAbsolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs formUCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, addingless than 1% trainable parameterswhile achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks.
PaperID: 173,   Poster  https://arxiv.org/pdf/2511.23332     GitHub GitHub
Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
Title: UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Abstract: Instructiondriven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image–mask–instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. The datasets and source code will be publicly released.
PaperID: 174,   Poster  https://arxiv.org/pdf/2604.09527     GitHub GitHub
Authors: Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi Kalayeh, Björn Ommer
Title: Envisioning the Future, One Step at a Time
Abstract: Accurately anticipating how complex, openworld scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence.We further introduce OWM, a benchmark for open-world motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-world future prediction both scalable and practical.
PaperID: 175,   Poster  https://arxiv.org/pdf/2603.00412     GitHub GitHub
Authors: Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan
Title: PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
Abstract: The development of 3D VisionLanguage Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations.To address these limitations, we propose \mname, a novel feature-level alignment regularization method. \mname explicitly supervises intermediate point cloud representations to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, \mname achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation.Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves 2.08 pp improvement on average for classification tasks, with a substantial 7.50 pp gain on the challenging open-vocabulary Objaverse classification task and 4.88 pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of \mname.
PaperID: 176,   Poster  https://arxiv.org/pdf/2512.02009     GitHub GitHub
Authors: Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, Xiangtai Li, jerett Jiang, Bo Du, Ming-Hsuan Yang, Lu Qi
Title: AirSim360: A Panoramic Simulation Platform within Drone View
Abstract: The field of 360degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and instance-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available.
PaperID: 177,   Poster  https://arxiv.org/pdf/2603.26481     GitHub GitHub
Authors: Weihong Pan, XiaoYu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang
Title: SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
Abstract: Highquality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.
PaperID: 178,   Poster  https://arxiv.org/pdf/2510.11712     GitHub GitHub
Authors: Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi
Title: DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Abstract: In this work, we propose DiT360, a DiTbased framework that performs hybrid training on perspective and panoramic data for panoramic image generation. We attribute the main challenges in preserving geometric fidelity and photorealism to the scarcity of large-scale, high-quality real-world panoramic data, in contrast to prior methods that emphasize model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code will be released publicly.
PaperID: 179,   Poster  https://arxiv.org/pdf/2511.20635     GitHub GitHub
Authors: ZHOUJIE FU, Xianfang Zeng, jinghong lan, Xinyao Liao, Chen Cheng, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin
Title: iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Abstract: Pretrained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Our code and model weights will be made publicly available.
PaperID: 180,   Poster  https://arxiv.org/pdf/2603.12083     GitHub GitHub
Authors: Xiaolong Qian, Qi Jiang, Yao Gao, Lei Sun, Zhonghua Yi, Kailun Yang, Luc Van Gool, Kaiwei Wang
Title: Universal Computational Aberration Correction: A Comprehensive Benchmark Analysis
Abstract: Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and laborintensive re-training for new lenses.Universal CAC paradigms trained on datasets encompassing diverse aberrations offer a promising solution to these challenges.However, efforts to develop universal CAC are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance.In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed \ourdataset, a large-scale benchmark constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation.Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects.We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations.Benchmarks, codes, and Zemax files will be available upon acceptance of the paper.
PaperID: 181,   Poster  https://arxiv.org/pdf/2603.22883     GitHub GitHub
Authors: Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, xiangpeng yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen
Title: Group Editing: Edit Multiple Images in One Go
Abstract: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudovideo and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model’s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.
PaperID: 182,   Poster  https://arxiv.org/pdf/2603.23408     GitHub GitHub
Authors: Joëlle Hanna, Damian Falk, Stella X. Yu, Damian Borth
Title: GeoSANE: Learning Geospatial Representations From Models, Not Data
Abstract: Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation.We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and taskspecific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities.Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities.By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \urlanonymized.
PaperID: 183,   Poster  https://arxiv.org/pdf/2604.16299     GitHub GitHub
Authors: Haoran Feng, Yifan Niu, Zehuan Huang, Yangtian Sun, Chunchao Guo, Yuxin Peng, Lu Sheng
Title: Repurposing 3D Generative Model for Autoregressive Layout Generation
Abstract: We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model to integrate scene, object, and instruction information and employ a dualguidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. We will release our code publicly.
PaperID: 184,   Poster  https://arxiv.org/pdf/2601.17470     GitHub GitHub
Authors: Chia-Ming Lee, Yu-Fan Lin, Yu-Jou Hsiao, Jin-Hui Jiang, Yu-Lun Liu, Chih-Chung Hsu
Title: PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
Abstract: Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance—a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through duallevel prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination.
PaperID: 185,   Poster  https://arxiv.org/pdf/2511.16156     GitHub GitHub
Authors: jian ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu
Title: Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resourceconstrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments.
PaperID: 186,   Poster  https://arxiv.org/pdf/2512.10416     GitHub GitHub
Authors: wenfei guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu
Title: Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet offroad environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.
PaperID: 187,   Poster  https://arxiv.org/pdf/2504.18594     GitHub GitHub
Authors: Tongrui Su, Qingbin Li, Shengyu Zhu, Wei Chen, Xueqi Cheng
Title: RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
Abstract: Compared to untargeted attacks, targeted transferbased attack is still suffering from much lower Attack Success Rates (ASRs), although significant improvements have been achieved by kinds of methods, such as diversifying input, stabilizing the gradient, and re-training surrogate models. In this paper, we find that adversarial examples generated by existing methods rely heavily on a small subset of surrogate model parameters, which in turn limits their transferability to unseen target models. Inspired by this, we propose the Random Parameter Pruning Attack (RaPA), which introduces parameter-level randomization during the attack process. At each optimization step, RaPA randomly prunes model parameters to generate diverse yet semantically consistent surrogate variants.We show this parameter-level randomization is equivalent to adding an importance-equalization regularizer, thereby alleviating the over-reliance issue. Extensive experiments across both CNN and Transformer architectures demonstrate that RaPA substantially enhances transferability. In the challenging case of transferring from CNN-based to Transformer-based models, RaPA achieves up to 11.7% higher average ASRs than state-of-the-art baselines(with 33.3% ASRs), while being training-free, cross-architecture efficient, and easily integrated into existing attack frameworks.
PaperID: 188,   Poster  https://arxiv.org/pdf/2512.10943     GitHub GitHub
Authors: Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov
Title: AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Abstract: Recent advances in subjectdriven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects.However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation.We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings.Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead.We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence.Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos.
PaperID: 189,   Poster  https://arxiv.org/pdf/2507.02803     GitHub GitHub
Authors: Gent Serifi, Marcel Buehler
Title: HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for highquality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into two state-of-the-art methods for face avatars: FlashAvatar and GaussianHeadAvatar. Our evaluation on 29 subjects from 6 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyes, teeth, wrinkles, and specular reflections.
PaperID: 190,   Poster  https://arxiv.org/pdf/2512.13680     GitHub GitHub
Authors: Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang
Title: LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
Abstract: Recent feedforward reconstruction models like VGGT and \pi^3 achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation (Sim(3)) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps.Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos.Our code will be released publicly.
PaperID: 191,   Poster  https://arxiv.org/pdf/2510.15742     GitHub GitHub
Authors: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
Title: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Abstract: Instructionbased video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing. We will release our dataset and models for reproducibility.
PaperID: 192,   Poster  https://arxiv.org/pdf/2603.09573     GitHub GitHub
Authors: Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen
Title: More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Abstract: Most existing visionlanguage models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we propose that panoramic vision-language understanding is more than the sum of its pinhole counterparts. We introduce Panorama-Language Modeling (PLM), a unified 360° visual-language reasoning. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that integrates diverse and adverse omni-scenes, enabling comprehensive reasoning under occlusion, accidents, and challenging conditions. To establish a foundation for PLM, we develop a plug-and-play panoramic adaptation module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under adverse omni-scenes, revealing that a full panorama yields understanding greater than the sum of its parts. All datasets and code will be publicly released.
PaperID: 193,   Poster  https://arxiv.org/pdf/2511.22940     GitHub GitHub
Authors: Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang
Title: One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
Abstract: Recent advances in diffusion models have greatly improved posedriven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available.
PaperID: 194,   Poster  https://arxiv.org/pdf/2603.29029     GitHub GitHub
Authors: Bharath Krishnamurthy, Ajita Rattani
Title: MMFace-DiT: A Dual-Stream Diffusion Transformer for Multimodal Face Generation
Abstract: Recent multimodal face generation models address the spatial control limitations of textto-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial–semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling.
PaperID: 195,   Poster  https://arxiv.org/pdf/2512.14654     GitHub GitHub
Authors: Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li
Title: ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
Abstract: CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks.Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning.In contrast, humans repeatedly examine visual image and employ stepby-step reasoning to prove intermediate propositions.This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science.Inspired by this insight, we propose aViRCframework for multimodal mathematical tasks, introducing aReason Chunkingmechanism that structures multimodal mathematical CoT into consecutiveCritical Reasoning Units (CRUs)to simulate human expert problem-solving patterns.CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning.To this end, we presentCRUXdataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem.Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resultingViRC-7Bmodel achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks.The codes will be made publicly available.
PaperID: 196,   Poster  https://arxiv.org/pdf/2603.09826     GitHub GitHub
Authors: Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu
Title: VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
Abstract: Textto-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset will be publicly released.
PaperID: 197,   Poster  https://arxiv.org/pdf/2505.16933     GitHub GitHub
Authors: Zebin You, Shen Nie, Xiaolu Zhang, JUN ZHOU, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Abstract: In this work, we introduce LLaDAV, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, leveraging diffusion language models' bidirectional attention to capture spatial relationships in visual data more effectively than causal, sequential processing. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive with LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation. To facilitate such research, we will open-source the LLaDA-V model along with its training and evaluation code.
PaperID: 198,   Poster  https://arxiv.org/pdf/2603.03195     GitHub GitHub
Authors: Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma
Title: Chain of World: World Model Thinking in Latent Motion
Abstract: VisionLanguage-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics.World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds.Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge.To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder.This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning.Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm.
PaperID: 199,   Poster  https://arxiv.org/pdf/2404.15254     GitHub GitHub
Authors: Zhuangcheng Gu, Guang Liang, Bin Wang, Zhiyuan Zhao, Qintong Zhang, Weijia Li, Chao Xu, Bo Zhang, Botian Shi, Jiang Wu, Wentao Zhang, Conghui He
Title: UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
Abstract: This paper introduces UniMERNet, a highaccuracy, computation-efficient algorithm for Mathematical Expression Recognition (MER) across diverse real-world scenarios. To facilitate UniMERNet's training, we constructed UniMER-1M, a million-scale dataset whose unprecedented diversity endows the model with robust generalization ability. Through in-depth analysis, we discover a distinctive raster-scan pattern (left-to-right, top-to-bottom) in the attention distribution of Transformer models for MER tasks, which closely aligns with human reading habits. Based on this key finding, we design an innovative Raster-Scan Attention mechanism that employs a ``horizontal-first, vertical-second" sequential attention computation strategy. This approach not only successfully reduces computational complexity from \mathcalO(NH^2W^2D) to \mathcalO(NHWD(H + W)) , but also enables the model to capture long-range dependencies more efficiently, achieving recognition performance comparable to global attention. Leveraging both UniMER-1M and our innovative attention mechanism, UniMERNet achieves state-of-the-art performance across four real-world scenarios while significantly reducing computational resources compared to global attention: over 1.2× memory savings during training, approximately 10× memory reduction during inference, and 5× speed with slightly improved accuracy. All resources will be publicly released to advance MER research further.
PaperID: 200,   Poster  https://arxiv.org/pdf/2602.10116     GitHub GitHub
Authors: Hongchi Xia, Xuan Li, Max Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei
Title: SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
Abstract: Realworld data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for Embodied AI. We will release both 3D scene and action generation code to foster further research.
PaperID: 201,   Poster  https://arxiv.org/pdf/2512.03534     GitHub GitHub
Authors: Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz
Title: Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in textto-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt.To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures.Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time.
PaperID: 202,   Poster  https://arxiv.org/pdf/2602.24233     GitHub GitHub
Authors: Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou
Title: Enhancing Spatial Understanding in Image Generation via Reward Modeling
Abstract: Recent progress in textto-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity—particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.
PaperID: 203,   Poster  https://arxiv.org/pdf/2602.06226     GitHub GitHub
Authors: Yuantao Chen, Jiahao Chang, Chongjie Ye, Chaoran Zhang, Zhaojie Fang, Chenghong Li, Xiaoguang Han
Title: ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
Abstract: The ubiquity of monocular videos capturing daily handobject interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup.
PaperID: 204,   Poster  https://arxiv.org/pdf/2604.00381     GitHub GitHub
Authors: DAEHYUN KIM, Youngmin Kim, Yoon Ju Oh, Tae Hyun Kim
Title: UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
Abstract: Underdisplay cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight Uncertainty-aware Context-Memory Network (UCMNet), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30% fewer parameters than previous models.
PaperID: 205,   Poster  https://arxiv.org/pdf/2512.25075     GitHub GitHub
Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao P. Huang
Title: SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter both the camera viewpoint and the motion sequence within the generative process, rerendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This simple yet crucial strategy enables the model to learn temporal control, directly producing the observed space–time disentanglement effects.To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic Space and Time full-coverage rendering dataset that provides fully free space–time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space–time disentanglement and strong results compared to prior arts.
PaperID: 206,   Poster  https://arxiv.org/pdf/2511.18346     GitHub GitHub
Authors: Wenshuo Gao, Junyi Fan, Jiangyue Zeng, Shuai Yang
Title: FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
Abstract: Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel trainingfree flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency.
PaperID: 207,   Poster  https://arxiv.org/pdf/2512.07831     GitHub GitHub
Authors: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
Title: UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Abstract: Recent video generation models demonstrate impressive synthesis capabilities but remain limited by singlemodality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation.To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities—segmentation masks, human skeletons, DensePose, optical flow, and depth maps—and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and models will be released. More results can be viewed in the supplementary.
PaperID: 208,   Poster  https://arxiv.org/pdf/2604.02829     GitHub GitHub
Authors: Hao Ren, Zetong Bi, Yiming Zeng, Zhaoliang Wan, Lu Qi, Hui Cheng
Title: STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation
Abstract: Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of firstperson visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. The code will be released to the public.
PaperID: 209,   Poster  https://arxiv.org/pdf/2603.19232     GitHub GitHub
Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu
Title: Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to lowdimensional VAE tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges.In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. Instead of treating spatial positions atomically, CubiD performs fine-grained masking throughout the high-dimensional discrete representation—any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions through attention, transforming an intractable O(hwd) sequential generation problem into O(T) parallel iterations where T \ll hwd. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures.
PaperID: 210,   Poster  https://arxiv.org/pdf/2601.03256     GitHub GitHub
Authors: Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang
Title: Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
Abstract: We present Muses, the first trainingfree method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation.In contrast, Muses leverages the 3D skeleton—a fundamental representation of biological forms—to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation.Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape.Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing.
PaperID: 211,   Poster  https://arxiv.org/pdf/2602.22073     GitHub GitHub
Authors: Artur Xarles i Esparraguera, Sergio Escalera, Thomas B. Moeslund, Albert Clapés
Title: AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
Abstract: Precise Event Spotting aims to localize fastpaced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose AdaSpot, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that AdaSpot achieves state-of-the-art performance under strict evaluation metrics (\eg, +3.96 and +2.26 mAP@0 frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code will be released upon publication.
PaperID: 212,   Poster  https://arxiv.org/pdf/2603.04254     GitHub GitHub
Authors: Seungjun Lee, Zihan Wang, Yunsong Wang, Gim Hee Lee
Title: EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
Abstract: Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly realtime manner. In this study, we proposeEmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Our code will be publicly available on paper acceptance.
PaperID: 213,   Poster  https://arxiv.org/pdf/2601.00204     GitHub GitHub
Authors: Xiaokun Sun, Zeyu Cai, Hao Tang, Ying Tai, Jian Yang, Zhenyu Zhang
Title: MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Abstract: 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a trainingfree framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models.
PaperID: 214,   Poster  https://arxiv.org/pdf/2603.16139     GitHub GitHub
Authors: Peng Sun, Jun XIE, Tao Lin
Title: Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Abstract: Unified Multimodal Models (UMMs) are often constrained by the pretraining of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art performance. For example, our IOMM-B (3.6B) model was trained from scratch using only ~1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE\textemdash surpassing strong baselines such as BAGEL-7B (0.82 \& 0.55) and BLIP3-o-4B (0.84 \& 0.50). Code will be released publicly.
PaperID: 215,   Poster  https://arxiv.org/pdf/2603.05908     GitHub GitHub
Authors: Zidian Qiu, Ancong Wu
Title: Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
Abstract: Current compositional imageto-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360^\circ environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. The code will be released if accepted.
PaperID: 216,   Poster  https://arxiv.org/pdf/2603.19731     GitHub GitHub
Authors: Jiadong Liang, Bojun Xiong, Jie Tian, Hua Li, Xiao Long, Yong Zheng, Huan Fu
Title: PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
Abstract: This paper primarily investigates the task of editing facial expression in an input portrait video based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile portrait video expression editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we modify the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more finegrained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. We will release our code and trained models to facilitate future research.
PaperID: 217,   Poster  https://arxiv.org/pdf/2603.07648     GitHub GitHub
Authors: Likui Zhang, Tao Tang, Zhihao Zhan, xiuwei chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, Liang Lin, Xiaodan Liang
Title: AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots
Abstract: Recent advances in VisualLanguage-Action (VLA) models have shown promising potential for robotic manipulation tasks.However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability.To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning.We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms \pi_0 by 2.4% on LIBERO, 10% on LIBERO-LONG, and outperforms \pi_0 and \pi_0.5 by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3% and 21% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks.
PaperID: 218,   Poster  https://arxiv.org/pdf/2604.04576     GitHub GitHub
Authors: Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song
Title: PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
Abstract: Diffusion models are promising for sparseview novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.
PaperID: 219,   Poster  https://arxiv.org/pdf/2602.00551     GitHub GitHub
Authors: Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen, Jianqiang Xiao, Shuo Yang
Title: APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
Abstract: The Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decisionmaking, and inefficient exploration and information gathering. To address these challenges, we introduceAPEX(Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design.
PaperID: 220,   Poster  https://arxiv.org/pdf/2511.02777     GitHub GitHub
Authors: Antonio Oroz, Matthias Nießner, Tobias Kirschstein
Title: PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Abstract: We present PercHead, a model for singleimage 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.
PaperID: 221,   Poster  https://arxiv.org/pdf/2511.17353     GitHub GitHub
Authors: Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang
Title: Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Abstract: Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems—including singlelens and metalens designs—is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments.This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature.Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models.To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors.VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process.We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model.Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods.These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler.All code and datasets will be publicly released.
PaperID: 222,   Poster  https://arxiv.org/pdf/2604.04406     GitHub GitHub
Authors: Ze-Xin Yin, Liu Liu, Xinjie wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie
Title: 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
Abstract: We introduce 3DFixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide the pre-trained object generation priors for in-place completion. Furthermore, we introduce the Occlusion-Robust Feature Alignment strategy, which employs feature distillation to stabilize the training of the generative priors under occlusion scenarios. Existing scene-level dataset, either suffering from limited scale or lacking accurate per-instance ground truth, severely restricting the development of scene generation approaches. Therefore, we constructed the large-scale scene-level dataset, featuring over 110K diverse scenes and 3M images with complete 3D asset ground truth and accurate placement annotation. Experiments demonstrate that 3D-Fixer achieves state-of-the-art geometric accuracy while maintaining an inference speed comparable to feed-forward estimation methods, vastly outperforming iterative optimization approaches. Our dataset and trained models will be publicly available upon acceptance.
PaperID: 223,   Poster  https://arxiv.org/pdf/2603.26908     GitHub GitHub
Authors: Jie Zhu, Xiao Guo, Yiyang Su, Anil Kumar Jain, Xiaoming Liu
Title: FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
Abstract: Systematic human recognition requires integrating multiple biometric traits such as face, gait, and body shape, through specialized models to achieve robustness in unconstrained scenarios. However, existing scorefusion strategies typically adopt a static design, combining all models for every test sample regardless of sample quality. This not only increases unnecessary computation but can degrade performance by incorporating noisy or unreliable modalities. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that \ours significantly outperforms SoTA methods, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. The proposed framework is scalable and adaptable to a wide range of multi-modal and multi-model tasks, such as vision-language retrieval, indicating its potential relevance to broader application scenarios. The code and model will be publicly released upon publication.
PaperID: 224,   Poster  https://arxiv.org/pdf/2506.19117     GitHub GitHub
Authors: Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger
Title: PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
Abstract: Existing approaches to 3D semantic urban scene generation predominantly rely on voxelbased representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis.
PaperID: 225,   Poster  https://arxiv.org/pdf/2604.05183     GitHub GitHub
Authors: Ali Aliev, Kamil Garifullin, Nikolay Yudin, Vera Soboleva, Alexander Molozhavenko, Ivan Oseledets, Aibek Alanov, Maxim Rakhuba
Title: OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
Abstract: In a rapidly growing field of model training there is a constant practical interest in parameterefficient model fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. Despite the efficiency of LoRA, one of the most popular fine-tuning methods nowadays, there is an open question: how to combine several adapters tuned for different tasks in one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and, utilizing manifold theory, get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by \mathcalGS orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. We identify that naive geodesic merging compresses spectral distributions, reducing expressiveness; our Cayley transform correction restores spectral properties for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two \mathcalGS orthogonal matrices is capable to unite concept and style features of different adapters. To our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters.
PaperID: 226,   Poster  https://arxiv.org/pdf/2604.02331     GitHub GitHub
Authors: Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, Guillermo Gallego
Title: EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Abstract: We propose EventHub, a novel framework for training deepevent stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities.Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.
PaperID: 227,   Poster  https://arxiv.org/pdf/2604.15809     GitHub GitHub
Authors: Chengxin Liu, Wonseok Choi, Chenshuang Zhang, Tae-Hyun Oh
Title: Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
Abstract: VisionLanguage Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent works show that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. The code will be made available.
PaperID: 228,   Poster  https://arxiv.org/pdf/2511.16595     GitHub GitHub
Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
Title: TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Abstract: We introduce TimeViper, a hybrid visionlanguage model designed to tackle challenges of long video understanding.Processing hour-long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from visual tokens to textual tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending input length. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures. All code and model weights will be released.
PaperID: 229,   Poster  https://arxiv.org/pdf/2512.10949     GitHub GitHub
Authors: Yiwen Tang, Ziyu Guo, Kaixin Zhu, Renrui Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
Title: Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multimodal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation.
PaperID: 230,   Poster  https://arxiv.org/pdf/2512.02341     GitHub GitHub
Authors: Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo
Title: TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feedforward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are provided in the supplementary material and will be publicly released.
PaperID: 231,   Poster  https://arxiv.org/pdf/2603.21511     GitHub GitHub
Authors: Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan
Title: Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Abstract: Zeroshot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring anomalous training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection.
PaperID: 232,   Poster  https://arxiv.org/pdf/2603.23067     GitHub GitHub
Authors: Basit Alawode, Arif Mahmood, Muaz Radi, Shahad Albastaki, Asim Khan, Muhammad Bilal, Moshira Abdalla, Mohammed Bennamoun, Sajid Javed
Title: MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
Abstract: Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic cues arise from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders finegrained grounding and ignores how pathologists synthesize evidence across different scales. We introduce MLLM-HWSI, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales—cell as word, patch as phrase, region as sentence, and WSI as paragraph—to support interpretable, evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific V\rightarrowL projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. To make gigapixel processing tractable and clinically meaningful, we compute diagnostically relevant tokens and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight Cell–Cell Attention Fusion (CCAF) transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By grounding language in calibrated, multi-scale visual evidence, HMLLM provides accurate, interpretable outputs that mirror expert diagnostic workflows and advance holistic WSI understanding. Code will be released upon the publication.
PaperID: 233,   Poster  https://arxiv.org/pdf/2508.05186     GitHub GitHub
Authors: Yongjie Bai, Zhouxia Wang, Yang Liu, Kaijun Luo, Yifan Wen, Mingtong Dai, weixing chen, Ziliang Chen, Lingbo Liu, Guanbin Li, Liang Lin
Title: Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Abstract: Recent visionlanguage-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions.
PaperID: 234,   Poster  https://arxiv.org/pdf/2603.25250     GitHub GitHub
Authors: Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz
Title: Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Abstract: Outof-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels.However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underlineTest-time \underlineActivated \underlineNegative \underlineLabels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses in the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric.Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number.Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness.Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%.Codes will be released.
PaperID: 235,   Poster  https://arxiv.org/pdf/2512.19433     GitHub GitHub
Authors: Yi Xin, Siqi Luo, Tianxiang Xu, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu
Title: dMLLMTTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
Abstract: Diffusion Multimodal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text–image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search.
PaperID: 236,   Poster  https://arxiv.org/pdf/2511.03571     GitHub GitHub
Authors: Hao Shi, Ze Wang, Shangwei Guo, Mengfei Duan, Song Wang, Teng Chen, Kailun Yang, Lin Wang, Kaiwei Wang
Title: OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
Abstract: Robust 3D semantic occupancy is essential for legged and humanoid robots, yet most Semantic Scene Completion (SSC) systems are built for wheeled platforms with forwardfacing sensors. We present OneOcc, a vision-only panoramic SSC framework tailored to severe body jitter and 360^\circ continuity. OneOcc integrates four complementary modules: (i) Dual-Projection fusion (DP-ER), which jointly exploits the raw annular panorama and its equirectangular unfolding to preserve true 360^\circ continuity while enabling grid-aligned feature extraction and seam-aware context; (ii) Bi-Grid Voxelization (BGV), which reasons in Cartesian and polar/cylindrical voxel spaces to reduce discretization bias and better align with panoramic geometry, yielding sharper free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D fusion that dynamically routes multi-scale 3D features to specialized experts, improving long-range context and occlusion handling; and (iv) a plug-and-play Gait Displacement Compensation (GDC) module that learns feature-level motion correction from gait, stabilizing representations without extra sensors. We also release two panoramic occupancy benchmarks: QuadOcc (real quadruped, first-person 360^\circ) and Human360Occ (H3O) (CARLA human-ego 360^\circ with RGB/Depth/semantic-occupancy and standardized within-/cross-city splits). OneOcc sets new SOTA: on QuadOcc it exceeds strong vision baselines and even popular LiDAR methods, and on H3O it improves within-city by +3.83 mIoU and cross-city by +8.08. The modules are lightweight, enabling deployable full-surround semantic perception for legged and humanoid robots. Datasets and code will be released upon publication.
PaperID: 237,   Poster  https://arxiv.org/pdf/2512.10957     GitHub GitHub
Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang
Title: SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient openset de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets will be released if the paper is accepted.
PaperID: 238,   Poster  https://arxiv.org/pdf/2601.21181     GitHub GitHub
Authors: Sang Yun Chung, Se Kim, Youngchae Chee, Yong Man Ro
Title: MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) suffer from crossmodal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods.
PaperID: 239,   Poster  https://arxiv.org/pdf/2602.18846     GitHub GitHub
Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Brahma, Zicheng Liu, Emad Barsoum
Title: DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Abstract: Visionlanguage models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone,often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens,followed by(b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens.This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens \downarrow, and still retains >97% even at 89% \downarrow reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% \downarrow and 97.6% at 89% \downarrow, surpassing prior SoTA visual token reduction methods across multiple benchmarks.When integrated into Video-LLaVA-7B, it even surpasses the baseline–achieving >100% \uparrow accuracy with a substantial 53.1% \downarrow token reduction, and retaining 97.6% accuracy under an extreme 93.4% \downarrow setting.These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget.
PaperID: 240,   Poster  https://arxiv.org/pdf/2505.20897     GitHub GitHub
Authors: Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Bin Zhao
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Abstract: Visionand-Language Navigation (VLN) requires the agent to navigate based on natural instructions. This task is challenging due to partial observability, which makes it difficult to align perception with language.Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details.To this end, we propose to adaptively imagine key environmental semantics via language form, enabling a more reliable and efficient strategy. Specifically, we introduce Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation.Furthermore, we propose a cross-interaction mechanism that regularizes the imagined latent-space outputs and integrates them with the navigation expert module via a decoder-free latent interface, thereby enabling ATD to jointly harness the reasoning ability of the LLM and the task-specific knowledge of the navigation model. We conduct extensive experiments across the R2R, REVERIE, and R4R benchmarks, demonstrating that ATD achieves competitive performance with significantly fewer parameters.The code will be publicly available.
PaperID: 241,   Poster  https://arxiv.org/pdf/2511.16301     GitHub GitHub
Authors: Minseok Seo, Mark Hamilton, Changick Kim
Title: Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
Abstract: We present Upsample Anything, a lightweight testtime optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14×/16× (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only \approx0.419 \texts per 224×224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
PaperID: 242,   Poster  https://arxiv.org/pdf/2603.21820     GitHub GitHub
Authors: Yanglin Deng, Tianyang Xu, Chunyang Cheng, Hui Li, Xiaojun Wu, Josef Kittler
Title: Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
Abstract: Infrared and visible image fusion (IVIF) aims to synthesise complementary information from the two source modalities while preserving natural textures and salient thermal signatures simultaneously. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labourintensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting the generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100× larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies.
PaperID: 243,   Poster  https://arxiv.org/pdf/2512.12675     GitHub GitHub
Authors: Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
Title: Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Abstract: Subjectdriven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in realistic and complex visual settings. We propose Scone, a unified understanding-generation framework that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge that conveys semantic information and guides the generation expert to preserve subject identity while reducing inference. A two-stage training scheme first learns composition and then strengthens distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark designed to evaluate composition, distinction, and their combination across diverse scenarios. Experiments show that Scone outperforms existing open-source models in both composition and distinction tasks. Our model, benchmark, and training data will be open-sourced.
PaperID: 244,   Poster  https://arxiv.org/pdf/2604.14259     GitHub GitHub
Authors: qianyu Chen, Shujian Yu
Title: Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
Abstract: Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of largescale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi-site access, making them unsuitable for real-world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting.This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. Our framework introduces a structure-aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi-level knowledge distillation strategy that aligns predictions and graph representations between new-site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling.Experiments on multi-site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting.
PaperID: 245,   Poster  https://arxiv.org/pdf/2603.07988     GitHub GitHub
Authors: Stefan Lionar, Gim Hee Lee
Title: TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
Abstract: Physicsbased humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.
PaperID: 246,   Poster  https://arxiv.org/pdf/2604.08546     GitHub GitHub
Authors: Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai
Title: When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Abstract: Textto-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt–layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code will be made available.
PaperID: 247,   Poster  https://arxiv.org/pdf/2603.09798     GitHub GitHub
Authors: Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang Li
Title: Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
Abstract: Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as humanrobot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE^2A^3) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. The code will be released.
PaperID: 248,   Poster  https://arxiv.org/pdf/2604.01777     GitHub GitHub
Authors: Mengtian Li, Fan Yang, Ruixue Xiong, Yiyan Fan, Zhifeng Xie, Zeyu Wang
Title: GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
Abstract: Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process timeconsuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens.
PaperID: 249,   Poster  https://arxiv.org/pdf/2603.01400     GitHub GitHub
Authors: Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe
Title: Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intraframe spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity.
PaperID: 250,   Poster  https://arxiv.org/pdf/2511.19024     GitHub GitHub
Authors: Tang Long, Huiyu Duan, Guoquan Zheng, Jianbo Zhang, Jie Hao, Liang Yuan
Title: Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Abstract: Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCNenhanced \underlinelayer\underlineinteraction and MoE-based \underlinefeature d\underlineecoupling, termed (Life-IQA). Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks. The code will be released upon the publication.
PaperID: 251,   Poster  https://arxiv.org/pdf/2508.03643     GitHub GitHub
Authors: Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park
Title: Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Abstract: Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly perscene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.
PaperID: 252,   Poster  https://arxiv.org/pdf/2601.10744     GitHub GitHub
Authors: sen wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang, Yuan Xie, Xin Tan
Title: Explore with Longterm Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
Abstract: An ideal embodied agent should possess lifelong learning capabilities to handle longhorizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent’s exploratory cognition and decision-making behaviors to promote lifelong learning. We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent’s memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.
PaperID: 253,   Poster  https://arxiv.org/pdf/2602.19715     GitHub GitHub
Authors: Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan, Abhinav Dhall
Title: Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision
Abstract: Deepfake detection models often generate naturallanguage explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator–evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2%, outperforming \texttt30x larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9% percent pairwise agreement on the human annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study indicates that humans prefer reasonings generated by our framework 70% of the time, in faithfullness, groundedness and usefulness compared to other models and datasets. All of our datasets, models, and codebase will be open-sourced.
PaperID: 254,   Poster  https://arxiv.org/pdf/2601.05149     GitHub
Authors: Elia Peruzzo, Guillaume Sautiere, Amirhossein Habibian
Title: Multi-Scale Speculative Decoding
Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by tokenlevel ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups --- up to \mathbf1.7× --- outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
PaperID: 255,   Poster  https://arxiv.org/pdf/2512.07833     GitHub
Authors: Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
Title: Relational Visual Similarity
Abstract: Humans do not just see attribute similaritywe also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. %intelligentYet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive.How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space?To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ.We then curate 114k image–caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision Language Model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance.Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.
PaperID: 256,   Poster  https://arxiv.org/pdf/2603.17375     GitHub
Authors: Yangtian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi
Title: Stereo World Model
Abstract: We present StereoWorld, a cameraconditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
PaperID: 257,   Poster  https://arxiv.org/pdf/2512.03216     GitHub
Authors: Alex Bocchieri, John Mamish, David Appleyard, Andreas Velten
Title: Kaleidoscopic Scintillation Event Imaging
Abstract: Scintillators are transparent materials that interact with highenergy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources.Most existing methods use fast single-pixel detectors to detect and time scintillation events.Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle.Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events.This allows us to use machine vision techniques to analyze events, enabling new types of detectors.The main challenge is the very low brightness of the events.Techniques have to work with a very limited number of photons.We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information.The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera.We introduce theory for imaging an event in a kaleidoscopic scintillatorand an algorithm to estimate the event's 3D position.We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera.
PaperID: 258,   Poster  https://arxiv.org/pdf/2603.29387     GitHub
Authors: Seungwoo Yoon, Jinmo Kim, Jaesik Park
Title: Extend3D: Town-scale 3D Generation
Abstract: In this paper, we propose Extend3D, a novel trainingfree pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in x and y directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoising paths do not deviate from the sub-scene dynamics. We demonstrate that our method produces better results than previous methods, as evidenced by human preferences.
PaperID: 259,   Poster  https://arxiv.org/pdf/2512.06905     GitHub
Authors: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
Title: Scaling Zero-Shot Reference-to-Video Generation
Abstract: Referenceto-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
PaperID: 260,   Poster  https://arxiv.org/pdf/2601.22680     GitHub
Authors: Rameen Abdal, James Burgess, Sergey Tulyakov, Kuan-Chieh Wang
Title: Visual Personalization Turing Test
Abstract: We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10kpersona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment–originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.
PaperID: 261,   Poster  https://arxiv.org/pdf/2512.14671     GitHub
Authors: Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong
Title: ART: Articulated Reconstruction Transformer
Abstract: We introduce ART, Articulated Reconstruction Transformer—a categoryagnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
PaperID: 262,   Poster  https://arxiv.org/pdf/2603.11992     GitHub
Authors: Ping Guo, ZHANG Tiantian, Xi Lin, Xiang Li, Zhi-Ri Tang, Qingfu Zhang
Title: Few-for-Many Personalized Federated Learning
Abstract: Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving M clients with distinct data distributions is inherently a multiobjective optimization problem, where achieving optimal personalization ideally requires M distinct models on the Pareto front. However, maintaining M separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few-for-many optimization problem that maintains only K shared server models (K \ll M) to collectively serve all M clients. We prove that this framework achieves near-optimal personalization: the approximation error diminishes as K increases and converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the K server models through efficient gradient-based updates. Unlike clustering-based approaches that require manual client partitioning or interpolation-based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real-world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state-of-the-art approaches.
PaperID: 263,   Poster  https://arxiv.org/pdf/2602.18858     GitHub
Authors: Ziheng Chen, Bernhard Schölkopf, Nicu Sebe
Title: Hyperbolic Busemann Neural Networks
Abstract: Hyperbolic spaces provide a natural geometry for representing hierarchical and treestructured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point-to-horosphere distance interpretation, batch-efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers.
PaperID: 264,   Poster  https://arxiv.org/pdf/2603.21786     GitHub
Authors: Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa
Title: The Universal Normal Embedding
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian.We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIMinverted noise arise as noisy linear projections.To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO).On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions.These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements.Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation.
PaperID: 265,   Poster  https://arxiv.org/pdf/2511.15661     GitHub
Authors: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
Title: VisPlay: Self-Evolving Vision-Language Models
Abstract: Reinforcement learning (RL) provides a principled framework for improving visionlanguage models (VLMs) on complex reasoning tasks. However, existing RL approaches often depend on human-annotated labels or task-specific heuristics to define verifiable rewards—both costly and limited in scalability. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning capabilities from massive unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained using Group Relative Policy Optimization (GRPO), which uses diversity and difficulty rewards to balance the difficulty of generated questions with the quality of silver answers. VisPlay scales efficiently across two model families. Trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU, and establishes a scalable path toward self-evolving multimodal intelligence.
PaperID: 266,   Poster  https://arxiv.org/pdf/2503.07507     GitHub
Authors: Jie Hu, Shizun Wang, Xinchao Wang
Title: PE3R: Perception-Efficient 3D Reconstruction
Abstract: Recent advances in 2Dto-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9× faster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D scene understanding. Code is available in supplementary material for reproducibility.
PaperID: 267,   Poster  https://arxiv.org/pdf/2602.23058     GitHub
Authors: Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley
Title: GeoWorld: Geometric World Models
Abstract: Energybased predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA-2.
PaperID: 268,   Poster  https://arxiv.org/pdf/2603.07694     GitHub
Authors: Yuhang Wang, Hai Li, Shujuan Hou, zhetao dong, yangxiaoyao yangxiaoyao
Title: Compressed-Domain-Aware Online Video Super-Resolution
Abstract: In bandwidthlimited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual-map-guided gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed.
PaperID: 269,   Poster  https://arxiv.org/pdf/2512.09363     GitHub
Authors: Ke Xing, longfei li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Xiaojie Jin, Yao Zhao, Yunchao Wei
Title: StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
Abstract: The growing adoption of XR devices has fueled strong demand for highquality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we presentStereoWorld, anend-to-end frameworkthat repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with ageometry-aware regularizationto ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate ahigh-definition stereo video datasetcontaining over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency.
PaperID: 270,   Poster  https://arxiv.org/pdf/2604.03296     GitHub
Authors: Chushan Zhang, Ruihan Lu, Jinguang Tong, Yikai Wang, Hongdong Li
Title: 3D-IDE: 3D Implicit Depth Emergent
Abstract: Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit groundtruth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code will be made publicly available upon acceptance.
PaperID: 271,   Poster  https://arxiv.org/pdf/2603.24571     GitHub
Authors: Yubo Li, Xugong Qin, peng zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang
Title: Towards Training-free Scene Text Editing
Abstract: Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require taskspecific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm.
PaperID: 272,   Poster  https://arxiv.org/pdf/2512.00993     GitHub
Authors: Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang
Title: PhotoFramer: Multi-modal Image Composition Instruction
Abstract: Composition matters during the phototaking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.
PaperID: 273,   Poster  https://arxiv.org/pdf/2511.14761     GitHub
Authors: Keya Hu, Ali Cy, Linlu Qiu, Delores(Xiaoman) Ding, Runqian Wang, Yeyin Zhu, Jacob Andreas, Kaiming He
Title: ARC Is a Vision Problem!
Abstract: The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a languageoriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a “canvas” that can be processed like natural images.It is then straightforward for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
PaperID: 274,   Poster  https://arxiv.org/pdf/2603.17554     GitHub
Authors: Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao
Title: Prompt-Free Universal Region Proposal Network
Abstract: Identifying potential objects is critical for object recognition and analysis across various computer vision applications.Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions.However, their reliance on image and text prompts often limits flexibility, restricting adaptability in realworld scenarios.In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (\ourmodel), which identifies potential objects without relying on external prompts.First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features.Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner.Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network.Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection.Experimental results across 19 datasets validate the effectiveness of our method.
PaperID: 275,   Poster  https://arxiv.org/pdf/2604.05039     GitHub
Authors: Julia Chae, Nicholas Kolkin, Jui-Hsien Wang, Richard Zhang, Sara Beery, Cusuh Ham
Title: ID-Sim: An Identity-Focused Similarity Metric
Abstract: Humans have remarkable selective sensitivity to identitiesthey easily distinguish between highly-similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress towards identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.
PaperID: 276,   Poster  https://arxiv.org/pdf/2508.08605     GitHub
Authors: Honglei Xu, Zhilu Zhang, Junjie Fan, Xiaohe Wu, Wangmeng Zuo
Title: SelfHVD: Self-Supervised Handheld Video Deblurring
Abstract: Shooting video with handheld shooting devices often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on realworld handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the deblurring ability of the model, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct synthetic and real-world handheld video datasets for handheld video deblurring. Extensive experiments on these and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets will be publicly available.
PaperID: 277,   Poster  https://arxiv.org/pdf/2603.27500     GitHub
Authors: Chang Sun, LiaoDongliang LiaoDongliang, Changxing Ding
Title: Streamlined Open-Vocabulary Human-Object Interaction Detection
Abstract: Openvocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training.Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories.However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations.To address this issue, we introduceSL-HOI, aStreamLined open-vocabularyHOIdetection framework based solely on the powerful DINOv3 model.Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification.Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone patch tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task.Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture.The code of this work will be released soon.
PaperID: 278,   Poster  https://arxiv.org/pdf/2604.06467     GitHub
Authors: Berna Kabadayi, Vanessa Sklyarova, Wojciech Zielonka, Justus Thies, Gerard Pons-Moll
Title: PhysHead: Simulation-Ready Gaussian Head Avatars
Abstract: Realistic digital avatars require expressive and dynamic hair motion, yet most existing head avatar methods assume rigid hair movement.These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multiview video. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. Especially, we propose the use of VLM-based models to generate appearance information of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method is able to synthesize physically plausible hair motion besides expression and camera control. The code will be released for research purposes.
PaperID: 279,   Poster  https://arxiv.org/pdf/2411.16758     GitHub
Authors: Muyao Niu, Yifan Zhan, Qingtian Zhu, Zhuoxiao Li, Wei Wang, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Title: Motion-Aware Animatable Gaussian Avatars Deblurring
Abstract: The creation of 3D human avatars from multiview videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-exposure camera system. Extensive evaluations demonstrate the effectiveness and robustness of the model across diverse conditions.
PaperID: 280,   Poster  https://arxiv.org/pdf/2506.01929     GitHub
Authors: Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or
Title: Image Generation from Contextually-Contradictory Prompts
Abstract: Textto-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
PaperID: 281,   Poster  https://arxiv.org/pdf/2603.26584     GitHub
Authors: Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor
Title: Scene Grounding in the Wild
Abstract: Reconstructing accurate 3D models of largescale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code, data, and trained models will be released.
PaperID: 282,   Poster  https://arxiv.org/pdf/2512.11130     GitHub
Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield
Title: Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
Abstract: Stereo foundation models achieve strong zeroshotgeneralization but remain computationally prohibitive forreal-time applications. Efficient stereo architectures, on the other hand, sacrificerobustness for speed and require costly per-domain fine-tuning.To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10× faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods.
PaperID: 283,   Poster  https://arxiv.org/pdf/2603.21002     GitHub
Authors: Kaixin Ding, Xi Chen, Sihui Ji, Yuan Gao, Liang Hou, Xin Tao, Hengshuang Zhao
Title: SURF: Signature-retained Fast Video Generation
Abstract: The demand for highresolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we proposeSURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5× speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7× speedup for generating 5-second, 24fps, 720p HunyuanVideo.
PaperID: 284,   Poster  https://arxiv.org/pdf/2512.22374     GitHub
Authors: Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
Title: Self-Evaluation Unlocks Any-Step Text-to-Image Generation
Abstract: We introduce the SelfEvaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
PaperID: 285,   Poster  https://arxiv.org/pdf/2507.02751     GitHub
Authors: Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang
Title: Partial Weakly-Supervised Oriented Object Detection
Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semisupervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose: (1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses traditional semi-supervised algorithms. Our code will be made publicly available.
PaperID: 286,   Poster  https://arxiv.org/pdf/2604.00519     GitHub
Authors: Jeffrey A. Chan-Santiago, Mubarak Shah
Title: Learnability-Guided Diffusion for Dataset Distillation
Abstract: Training machine learning models on massive datasets is expensive and timeconsuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled datasets, either by producing diverse samples or by matching the training gradients of the original data. However, existing distilled datasets contain redundant training signals—samples provide overlapping information. Empirically, disjoint subsets of existing distilled datasets capture 70–80% overlapping training signals. This redundancy arises because existing methods optimize for visual diversity or average training trajectories without accounting for training signal similarity across samples. This produces datasets where multiple samples teach the model similar information rather than providing complementary knowledge across training stages.We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small distilled dataset, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce learnability-guided diffusion that balances current-model informativeness with reference-model validity, automatically generating curriculum-aligned samples. Our approach reduces redundancy by 39.1%, enables specialization across training phases, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%)
PaperID: 287,   Poster  https://arxiv.org/pdf/2603.03281     GitHub
Authors: Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan
Title: CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Abstract: ClassifierFree Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework calledCFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional–unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales.
PaperID: 288,   Poster  https://arxiv.org/pdf/2505.02242     GitHub
Authors: Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song
Title: Sampling-Aware Quantization for Diffusion Models
Abstract: Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in lowlatency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.
PaperID: 289,   Poster  https://arxiv.org/pdf/2511.23186     GitHub
Authors: Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi
Title: Obstruction reasoning for robotic grasping
Abstract: Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current visionlanguage embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives.
PaperID: 290,   Poster  https://arxiv.org/pdf/2604.15979     GitHub
Authors: Chenye Wang, Qingyuan Cai, Saihui Hou, Aoqi Li, Yongzhen Huang
Title: MMGait: Towards Multi-Modal Gait Recognition
Abstract: Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGBderived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, complete codebase, and pretrained checkpoints will be publicly released upon acceptance to promote future research in multi-modal gait recognition.
PaperID: 291,   Poster  https://arxiv.org/pdf/2603.17605     GitHub
Authors: Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Hu, Shaoxiang Wang, Alain Pagani, Didier Stricker
Title: ReLaGS: Relational Language Gaussian Splatting
Abstract: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either objectcentric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval.
PaperID: 292,   Poster  https://arxiv.org/pdf/2507.03745     GitHub
Authors: Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, Yue Zhao
Title: StreamDiT: Real-Time Streaming Text-to-Video Generation
Abstract: Recently, great progress has been achieved in textto-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video.
PaperID: 293,   Poster  https://arxiv.org/pdf/2511.06281     GitHub
Authors: Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
Title: VideoSSR: Video Self-Supervised Reinforcement Learning
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, highquality data remains prohibitively expensive.This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?To investigate this problem, we first introduce three self-supervised pretext tasks for video understanding: Anomaly Grounding, Object Counting, and Temporal Jigsaw. To validate the difficulty of these tasks, we construct the Video Intrinsic Understanding Benchmark (VIUBench), revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that our VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs.
PaperID: 294,   Poster  https://arxiv.org/pdf/2604.09199     GitHub
Authors: Agniva Sengupta, Dilara Kus, Jianning Li, Stefan Zachow
Title: Globally Optimal Pose from Silhouettes
Abstract: We solve the problem of determining the pose of known shapes in \mathbbR^3 from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet underexplored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches.
PaperID: 295,   Poster  https://arxiv.org/pdf/2411.10794     GitHub
Authors: Sudarshan Regmi
Title: Image-Based Outlier Synthesis With Training Data
Abstract: Outof-distribution (OOD) detection is critical to ensure the safe deployment of deep learning models in critical applications. Deep learning models can often misidentify OOD samples as in-distribution (ID) samples. This vulnerability worsens in the presence of spurious correlation in the training set. Likewise, in fine-grained classification settings, detection of fine-grained OOD samples becomes inherently challenging due to their high similarity to ID samples. However, current research on OOD detection has focused instead largely on relatively easier (conventional) cases. Even the few recent works addressing these challenging cases rely on carefully curated or synthesized outliers, ultimately requiring external data. This motivates our central research question: ``Can we innovate OOD detection training framework for fine-grained and spurious settings without requiring any external data at all?" In this work, we present a unified Approach to Spurious, fine-grained, and Conventional OOD Detection (\ASCOOD) that eliminates the reliance on external data. First, we synthesize virtual outliers from ID data by approximating the destruction of invariant features. Specifically, we propose to add gradient attribution values to ID inputs to disrupt invariant features while amplifying true-class logit, thereby synthesizing challenging near-manifold virtual outliers. Then, we simultaneously incentivize ID classification and predictive uncertainty towards virtual outliers. For this, we further propose to leverage standardized features with z-score normalization. ASCOOD effectively mitigates impact of spurious correlations and encourages capturing fine-grained attributes. Extensive experiments across 7 datasets and and comparisons with 30+ methods demonstrate merit of ASCOOD in spurious, fine-grained and conventional settings.
PaperID: 296,   Poster  https://arxiv.org/pdf/2603.02098     GitHub
Authors: Chuong Huynh, Manh Luong, Abhinav Shrivastava
Title: Efficient and High-Fidelity Omni Modality Retrieval
Abstract: Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. Stateof-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. This shared module is designed to maintain representational diversity and generalization capabilities while remaining sensitive to modality-specific information. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations.OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks—composed audio retrieval and audio-visual retrieval—to more comprehensively evaluate a model's omni-modal embedding capacity. We believe our benchmark will facilitate the development of universal retrieval systems.
PaperID: 297,   Poster  https://arxiv.org/pdf/2603.12217     GitHub
Authors: Görkay Aydemir, Fatma Güney, Weidi Xie
Title: Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Abstract: Models for longterm point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due todifferent characteristics and the absence of dense ground-truth annotations.Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher predictions, which vary across frames and scenes.In this paper, we address the problem of real-world fine-tuning and introduce Verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation.Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions to construct refined pseudo-label trajectories.When applied during fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos.Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods.
PaperID: 298,   Poster  https://arxiv.org/pdf/2512.23351     GitHub
Authors: Niki Amini-Naieni, Andrew Zisserman
Title: CountGD++: Generalized Prompting for Open-World Counting
Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudoexemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets.
PaperID: 299,   Poster  https://arxiv.org/pdf/2512.03045     GitHub
Authors: Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Min-Seop Kwak, Jin-Hwa Kim, Seungryong Kim
Title: Correspondence-Attention Alignment for Multi-view Diffusion Models
Abstract: Multiview diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model. Code will be publicly released.
PaperID: 300,   Poster  https://arxiv.org/pdf/2603.01506     GitHub
Authors: Jianqiang Ren, Lin Liu, Steven Hoi
Title: OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Abstract: We propose OMGAvatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
PaperID: 301,   Poster  https://arxiv.org/pdf/2603.17671     GitHub
Authors: Liangyu Yuan, Ruoyu Wang, Tong Zhao, Dingwen Fu, Mingkun Lei, Beier Zhu, Chi Zhang
Title: Few-Step Diffusion Sampling Through Instance-Aware Discretizations
Abstract: Diffusion and flow matching models generate highfidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance.Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.
PaperID: 302,   Poster  https://arxiv.org/pdf/2512.13421     GitHub
Authors: Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li
Title: RecTok: Reconstruction Distillation along Rectified Flow
Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental tradeoff is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models (VFMs) to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction–alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distill the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model will be publicly available.
PaperID: 303,   Poster  https://arxiv.org/pdf/2604.13508     GitHub
Authors: sanghyeok chu, Pyunghwan Ahn, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Bohyung Han
Title: Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Abstract: Sparse upcycling provides an efficient way to initialize a Mixtureof-Experts (MoE) model from a pretrained dense checkpoint instead of training from scratch.However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization.We propose Cluster-aware Upcycling, a strategy that embeds semantic structure into MoE initialization.The method clusters the dense model’s input activations to identify latent subspaces, initializes each expert using a data-aware truncated SVD of the dense weights within its cluster, and initializes the router with the corresponding cluster centroids.This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data structure.In addition, we introduce an Expert-Ensemble Self-Distillation loss that regularizes training by guiding uncertain routing with stable predictions from an ensemble teacher.Applied to CLIP ViT-B/16 and ViT-B/32 models, Cluster-aware Upcycling achieves consistent improvements over standard upcycling across zero-shot and few-shot benchmarks, and produces more diverse and disentangled expert representations.
PaperID: 304,   Poster  https://arxiv.org/pdf/2603.16447     GitHub
Authors: Kaiwen Song, Jinkai Cui, Juyong Zhang
Title: ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
Abstract: In practical realtime XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive, streamable 3D representation method is needed that can be immediately deployed and continuously optimized as resources increase. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown byadaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face‑local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. ProgressiveAvatars supports incremental loading rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. Thanks to our progressive representation method with an inherited tree structure, ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.
PaperID: 305,   Poster  https://arxiv.org/pdf/2604.02956     GitHub
Authors: Zimeng Wu, Yunhong Wang, Donghao Wang, Jiaxin Chen
Title: Collaborative Multi-Mode Pruning for Vision-Language Models
Abstract: VisionLanguage Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning.Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores.Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum.Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches.The source code will be released upon acceptance.
PaperID: 306,   Poster  https://arxiv.org/pdf/2512.07834     GitHub
Authors: Yichuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
Title: Voxify3D: Pixel Art Meets Volumetric Rendering
Abstract: Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either oversimplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20³-50³ resolutions).
PaperID: 307,   Poster  https://arxiv.org/pdf/2604.14706     GitHub
Authors: Yi He, Tao Wang, Yi Jin, Congyan Lang, Yidong Li, Haibin Ling
Title: NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NGGS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS and LERF-OVS benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU.
PaperID: 308,   Poster  https://arxiv.org/pdf/2602.19285     GitHub
Authors: Jindi Kong, Yuting He, Cong Xia, YWUSO YWUSO, Shuo Li
Title: MRI Contrast Enhancement Kinetics World Model
Abstract: Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in human body enables continuous contrastfree dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to the sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follows a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrain smooth variations in the latent space among interpolated sequence. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available.
PaperID: 309,   Poster  https://arxiv.org/pdf/2603.09611     GitHub
Authors: KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho
Title: ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
Abstract: Textto-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
PaperID: 310,   Poster  https://arxiv.org/pdf/2604.01251     GitHub
Authors: Yao Jiang, Zhongkuan Mao, xuan wu, Keren Fu, Qijun Zhao
Title: Camouflage-aware Image-Text Retrieval via Expert Collaboration
Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust imagetext cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising ~10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript2GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves a ~29% CA-ITR accuracy boost, surpassing seven representative retrieval models. Our dataset and code will be made publicly available.
PaperID: 311,   Poster  https://arxiv.org/pdf/2511.22553     GitHub
Authors: Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu
Title: Bringing Your Portrait to 3D Presence
Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, halfbody, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation.We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts.We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency.A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results.Extensive experiments and analyses further validate the effectiveness of our approach. Code will be released upon acceptance.
PaperID: 312,   Poster  https://arxiv.org/pdf/2602.04876     GitHub
Authors: Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu
Title: PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
Abstract: We introduce PerpetualWonder, a hybrid generative simulator that enables longhorizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
PaperID: 313,   Poster  https://arxiv.org/pdf/2512.13689     GitHub
Authors: Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan D. Wegner, Christian Rupprecht, Konrad Schindler
Title: LitePT: Lighter Yet Stronger Point Transformer
Abstract: Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract lowlevel geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6× fewer parameters, runs 2× faster, and uses 2× less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets.
PaperID: 314,   Poster  https://arxiv.org/pdf/2604.05259     GitHub
Authors: Timothy Chen, Adam Dai, Maximilian Adang, Grace Gao, Mac Schwager
Title: Coverage Optimization for Camera View Selection
Abstract: What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coveragebased view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We integrate our method into the Nerfstudio framework and evaluate it on synthetic and real scenes. Across multiple datasets and radiance-field baselines, our method achieves consistently improved reconstruction quality compared to state-of-the-art active view selection methods.
PaperID: 315,   Poster  https://arxiv.org/pdf/2604.07795     GitHub
Authors: Changwoon Choi, Hyunsoo Lee, Clément Jambon, Yael Vinker, Young Min Kim
Title: Image-Guided Geometric Stylization of 3D Meshes
Abstract: Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pretrained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations.
PaperID: 316,   Poster  https://arxiv.org/pdf/2601.04194     GitHub
Authors: Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu
Title: Choreographing a World of Dynamic Objects
Abstract: Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we study a universal generative pipeline for synthesizing this type of phenomena. Traditional rulebased graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies.
PaperID: 317,   Poster  https://arxiv.org/pdf/2512.02781     GitHub
Authors: Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka
Title: LumiX: Structured and Coherent Text-to-Intrinsic Generation
Abstract: We present LumiX, a structured diffusion framework for coherent textto-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
PaperID: 318,   Poster  https://arxiv.org/pdf/2603.19660     GitHub
Authors: Yichen Zeng, Hebaixu Wang, Meng Liu, Yu ZHOU, Chen Gao, Kehan Chen, Gongping Huang
Title: Semantic Audio-Visual Navigation in Continuous Environments
Abstract: Audiovisual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches depend on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios.
PaperID: 319,   Poster  https://arxiv.org/pdf/2507.01938     GitHub
Authors: Yiming Ju, Jijin Hu, Zhengxiong Luo, Haoge Deng, Hanyu Zhao, Li Du, Wenbo Xiao, Chengwei Wu, Donglin Hao, Xinlong Wang, Tengfei Pan
Title: CI-VID: A Coherent Interleaved Text-Video Dataset
Abstract: Textto-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text–video (T–V) pairs and thus fail to model inter-clip relationships. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated T2V generation toward text-and-video-to-video (T&V2V) generation. CI-VID contains over 340,000 samples, each comprising a semantically coherent video sequence with interleaved text captions that capture both clip-level content and inter-clip relationships. To validate its effectiveness, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID significantly improve both accuracy and content consistency in multi-clip video generation. This enables the creation of story-driven content with smooth transitions and strong semantic coherence, underscoring the value of CI-VID as a foundation for advancing controllable and coherent video generation.
PaperID: 320,   Poster  https://arxiv.org/pdf/2503.09750     GitHub
Authors: Haoan Feng, Diana Aldana Moreno, Tiago Novello, Leila Floriani
Title: SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Abstract: Sinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for lowdimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose SASNet, a Spatially-Adaptive Sinusoidal Network that couples a frozen frequency embedding layer, which explicitly fixes the network’s frequency support, with jointly learned spatial masks that localize neuron influence across the domain. This pairing stabilizes optimization, sharpens edges, and suppresses noise in smooth areas. Experiments on 2D image and 3D volumetric data fitting as well as signed distance field (SDF) reconstruction benchmarks demonstrate that SASNet achieves faster convergence, superior reconstruction quality, and robust frequency localization--assigning low- and high-frequency neurons to smooth and detailed regions respectively--while maintaining parameter efficiency.
PaperID: 321,   Poster  https://arxiv.org/pdf/2505.12200     GitHub
Authors: Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, Shaohui Lin
Title: CompBench: Benchmarking Complex Instruction-guided Image Editing
Abstract: While realworld applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce CompBench, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.
PaperID: 322,   Poster  https://arxiv.org/pdf/2512.04267     GitHub
Authors: Zitian Zhang, Iliyan Georgiev, Michael Fischer, Yannick Hold-Geoffroy, Jean-François Lalonde, Valentin Deschaintre
Title: UniLight: A Unified Representation for Lighting
Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits crossmodal transfer. We thus propose Unilight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.
PaperID: 323,   Poster  https://arxiv.org/pdf/2512.22581     GitHub
Authors: Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, Andrew J. Davison
Title: KV-Tracker: Real-Time Pose Tracking with Transformers
Abstract: Multiview 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via \pi^3~\citewang2025pi3 with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to 15× speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining.We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ~27 FPS.
PaperID: 324,   Poster  https://arxiv.org/pdf/2603.28211     GitHub
Authors: Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas
Title: Explaining CLIP Zero-shot Predictions Through Concepts
Abstract: Largescale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce a framework that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. Our method projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP’s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models.
PaperID: 325,   Poster  https://arxiv.org/pdf/2506.00721     GitHub
Authors: Tianze Yang, Tyson Jordan, Ruitong Sun, Ninghao Liu, Jin Sun
Title: Common Inpainted Objects In-N-Out of Context
Abstract: We present Common Inpainted Objects InN-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) developing a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics.
PaperID: 326,   Poster  https://arxiv.org/pdf/2511.22625     GitHub
Authors: Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xianfang Zeng, Gang Yu, Daxin Jiang
Title: ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1XEdit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training.In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy.Based on that, our proposed framework enables image editing in a thinking–editing–reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round.Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit(+4.4%), GEdit(+3.1%), and Kris(+11.5%) when initializing our DiT from the Step1X-Edit(ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit(ReasonEdit-Q).Code and checkpoints will be open-sourced.
PaperID: 327,   Poster  https://arxiv.org/pdf/2603.16737     GitHub
Authors: Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang
Title: Retrieving Counterfactuals Improves Visual In-Context Learning
Abstract: Visionlanguage models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and causally grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning.
PaperID: 328,   Poster  https://arxiv.org/pdf/2506.02356     GitHub
Authors: Woojeong Jin, Seongchan Kim, Jae Lee, Seungryong Kim
Title: InterRVOS: Interaction-Aware Referring Video Object Segmentation
Abstract: Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression.However, most existing approaches focus only on the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. In this paper, we introduce InteractionAware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on explicit interaction modeling by requiring separate segmentation of actor and target objects.This formulation enables fine-grained understanding of object relationships, as many video events are defined by such interactions rather than individual objects. We present InterRVOS-127K, a large-scale dataset of over 127K automatically annotated expressions with distinct actor-target mask pairs, and propose ReVIOSa, a MLLM-based architecture that introduces interaction-aware special tokens and attention mask loss (AML) to enhance interaction-aware segmentation. We also propose a new evaluation protocol that separately evaluates actor and target segmentation for more accurate role distinction. Comprehensive experiments demonstrate that ReVIOSa outperforms existing baselines on the proposed InterRVOS-127K benchmark, with further analyses validating the necessity and effectiveness of both ReVIOSa and InterRVOS-127K. Code and datasets will be made publicly available.
PaperID: 329,   Poster  https://arxiv.org/pdf/2601.15288     GitHub
Authors: Jiwon Kang, Yeji Choi, JoungBin Lee, Wooseok Jang, Jinhyeok Choi, Taekeun Kang, Yongjae Park, Myungin Kim, Seungryong Kim
Title: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Abstract: Face swapping aims to transfer the identity of a source face onto a target face while preserving targetspecific attributes such as pose, expression, lighting, skin tone, and makeup. However, since real face-swapping ground truth is unavailable, achieving both accurate identity transfer and high-quality attribute preservation remains challenging. Although recent diffusion-based approaches attempt to improve visual fidelity through conditional inpainting on masked target images, the masked condition removes crucial appearance cues, resulting in plausible yet misaligned attributes due to the lack of explicit supervision. To address these limitations, we propose APPLE (Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping) a diffusion-based teacher–student framework that enhances attribute fidelity through attribute-aware pseudo-label supervision. First, we reformulate face swapping as a conditional deblurring task to more faithfully preserve target-specific attributes such as lighting, skin tone, and makeup. In addition, we introduce an attribute-aware inversion scheme to further improve detailed attribute preservation. Through an elaborate attribute-preserving design for teacher learning, APPLE produces high-quality pseudo triplets that explicitly provide the student with direct face-swapping supervision. Overall, APPLE achieves state-of-the-art performance in terms of attribute preservation and identity transfer, producing more photorealistic and target-faithful results. Code will be made publicly available.
PaperID: 330,   Poster  https://arxiv.org/pdf/2510.21697     GitHub
Authors: Nir Goren, Shai Yehezkel, Omer Dahary, Andrey Voynov, Or Patashnik, Daniel Cohen-Or
Title: Visual Diffusion Models are Geometric Solvers
Abstract: In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a longstanding problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Maximum Area Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.
PaperID: 331,   Poster  https://arxiv.org/pdf/2512.13592     GitHub
Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
Title: Image Diffusion Preview with Consistency Solver
Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, lowstep sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. In this paper, we propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency.Experimental results demonstrate that ConsistencySolver significantly improves generation quality in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality.
PaperID: 332,   Poster  https://arxiv.org/pdf/2603.26929     GitHub
Authors: Xinyu Yang, Haozheng Yu, Yihong Sun, Bharath Hariharan, Jennifer J. Sun
Title: Live Interactive Training for Video Segmentation
Abstract: Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.).Yet, even stateof-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video.Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks.
PaperID: 333,   Poster  https://arxiv.org/pdf/2604.15923     GitHub
Authors: Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen, Zhaoyang Li, Boyuan Cao, Hongming Shan
Title: Hierarchical Codec Diffusion for Video-to-Speech Generation
Abstract: Videoto-Speech (VTS) generation aims to synthesize speech solely from a silent video without auditory signals, and holds substantial promise for applications such as film dubbing and voice restoration for individuals with aphonia. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the distinctive hierarchical structure of Residual Vector Quantization (RVQ)-based codecs, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve efficient alignment. Specifically, since lower-level tokens encode coarse speaker-aware content and higher-level tokens capture fine-grained prosody, \methodname employs separate low-level and high-level blocks to generate tokens at corresponding codec layers. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content modeling, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale Adaptive Instance Layer Normalization (AdaLN) that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that \methodname outperforms state-of-the-art baselines in fidelity, semantic consistency, and expressiveness, highlighting the effectiveness of integrating speech hierarchy for VTS generation.
PaperID: 334,   Poster  https://arxiv.org/pdf/2602.21581     GitHub
Authors: Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu
Title: MultiAnimate: Pose-Guided Image Animation Made Extensible
Abstract: Poseguided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components—Identifier Assigner and Identifier Adapter—which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines. Codes will be released.
PaperID: 335,   Poster  https://arxiv.org/pdf/2602.13172     GitHub
Authors: Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, Hao Wang
Title: LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
Abstract: Longsequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference.Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.
PaperID: 336,   Poster  https://arxiv.org/pdf/2511.20648     GitHub
Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Liangyan Gui, Linxi Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
Title: LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today's visionlanguage models excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how people reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP, surpassing the previous best by +15.51 absolute improvement even when baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong calibration and robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
PaperID: 337,   Poster  https://arxiv.org/pdf/2603.05888     GitHub
Authors: Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie, Zhuowen Tu
Title: PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Abstract: We introduce PixARMesh, the first method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and posthoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative modeling, we enrich a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream of context, pose, and mesh tokens, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.
PaperID: 338,   Poster  https://arxiv.org/pdf/2512.23568     GitHub
Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei
Title: ThinkGen: Generalized Thinking for Visual Generation
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chainof-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.
PaperID: 339,   Poster  https://arxiv.org/pdf/2511.20640     GitHub
Authors: Ryan Burgert, Charles Herrmann, Forrester Cole, Michael Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
Title: MotionV2V: Editing Motion in a Video
Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has extensively explored motion controllability as a means to enhance textto-video generation or image animation; however, we identify precise motion control as a promising, yet under-explored, paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a 'motion edit' and demonstrate that this representation, when coupled with a generative backbone, enables many powerful video editing capabilities. To achieve this, we introduce a novel pipeline for generating `motion counterfactuals' — video pairs that share identical content but distinct motion — and fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a 4-way head-to-head user study, our model achieves over 65% preference against prior work.
PaperID: 340,   Poster  https://arxiv.org/pdf/2603.00682     GitHub
Authors: Yushan Han, Hui Zhang, Qiming Xia, Yi Jin, Yidong Li
Title: CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
Abstract: Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communicationefficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings.
PaperID: 341,   Poster  https://arxiv.org/pdf/2603.28045     GitHub
Authors: Jae-Young Kang, Hoonhee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, Kuk-Jin Yoon
Title: Event6D: Event-based Novel Object 6D Pose Tracking
Abstract: Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an eventdepth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets establish event cameras as a viable solution for event-based novel object 6D pose tracking.
PaperID: 342,   Poster  https://arxiv.org/pdf/2603.10780     GitHub
Authors: shilong han, Yuming Zhang, Hongxia Wang
Title: Guiding Diffusion Models with Semantically Degraded Conditions
Abstract: ClassifierFree Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt (\varnothing) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, c_deg . This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. To synthesize \boldsymbolc_\textdeg adaptively, our method models the self-attention mechanism as a graph and employs Weighted PageRank to identify and degrade the most semantically salient tokens. Validated on state-of-the-art models like Stable Diffusion 3, CDG markedly improves compositional accuracy and text-image alignment, addressing key failure modes of the baseline. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control.
PaperID: 343,   Poster  https://arxiv.org/pdf/2603.07561     GitHub
Authors: Zhichao Liao, Xiaole Xian, Qingyu Li, Wenyu Qin, Meng Wang, Weicheng Xie, Siyang Song, Pingfa Feng, Long ZENG, Liang Pan
Title: PureCC: Pure Learning for Text-to-Image Concept Customization
Abstract: Existing concept customization methods have achieved remarkable outcomes in highfidelity and multi-concept customization.However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts.To address this issue, we propose PureCC. PureCC novelly introduces a decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representation as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concept. Furthermore, PureCC introduces an novel adaptive guidance scale \lambda^\star to dynamically adjust the guidance strength of the target concept, balancing between customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization.
PaperID: 344,   Poster  https://arxiv.org/pdf/2509.05342     GitHub
Authors: Gaspard Beaudouin, Minghan LI, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang
Title: Delta Rectified Flow Sampling for Text-to-Image Editing
Abstract: We propose Delta Rectified Flow Sampling (DRFS), a novel inversionfree, path-aware editing framework within rectified flow models for text-to-image editing. DRFS is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DRFS reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DRFS generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. We conduct an analysis to guide the design of our shift term, and experimental results on the widely used PIE Benchmark indicate that DRFS achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications.
PaperID: 345,   Poster  https://arxiv.org/pdf/2604.02845     GitHub
Authors: Chengxing Lin, Jinhong Deng, Yinjie Lei, Wen Li
Title: Deformation-based In-Context Learning for Point Cloud Understanding
Abstract: Recent advances in point cloud InContext Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training–inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.
PaperID: 346,   Poster  https://arxiv.org/pdf/2603.20708     GitHub
Authors: Xiaoran Zhang, Jian Ding, Yuxing Duan, Haoyue Liu, Gang Chen, Yi Chang, Luxin Yan
Title: High-Quality and Efficient Turbulence Mitigation with Events
Abstract: Turbulence mitigation (TM) is highly illposed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent "event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Extensive experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively.
PaperID: 347,   Poster  https://arxiv.org/pdf/2603.24278     GitHub
Authors: Guan Luo, Xiu Li, Rui Chen, Xuanyu Yi, Jing Lin, Chia-Hao Chen, Jiahang Liu, Song-Hai Zhang, Jianfeng Zhang
Title: TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
Abstract: The dominant paradigm for highfidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L\infty distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details, as shown in Fig. 1.
PaperID: 348,   Poster  https://arxiv.org/pdf/2602.09999     GitHub
Authors: Florian Hahlbohm, Linus Franke, Martin Eisemann, Marcus Magnor
Title: Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementationlevel improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison.In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation.The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5× faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.
PaperID: 349,   Poster  https://arxiv.org/pdf/2512.22489     GitHub
Authors: Tanish Baranwal, Himanshu Singh Singh, Jathushan Rajasegaran, Jitendra Malik
Title: Tracking by Predicting 3-D Gaussians Over Time
Abstract: We propose Video Gaussian Masked Autoencoders (VideoGMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.
PaperID: 350,   Poster  https://arxiv.org/pdf/2511.13719     GitHub
Authors: Zhongang Cai, Wang Ruisi, Chenyang Gu, Fanyi Pu, Junxiang Xu, YUBO WANG, Wanqi Yin, Zhitao Yang, Chen Wei, Tongxi Zhou, Qingping SUN, Hui En Pang, Jiaqi Li, Oscar Qian, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
Title: Scaling Spatial Intelligence with Multimodal Foundation Models
Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SSI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SSI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SSI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, and 54.6% on ViewSpatial, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
PaperID: 351,   Poster  https://arxiv.org/pdf/2511.21678     GitHub
Authors: Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, KunbinChen KunbinChen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
Title: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Abstract: MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo—solving each problem independently and often repeating the same mistakes. Existing memoryaugmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only asingle-modalitytrace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is bothmultimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduceViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge—preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across nine multimodal benchmarks,ViLoMemconsistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction–hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning.
PaperID: 352,   Poster  https://arxiv.org/pdf/2602.20068     GitHub
Authors: Harry Anthony, Ziyun Liang, Hermione Warr, Konstantinos Kamnitsas
Title: The Invisible Gorilla Effect in Out-of-distribution Detection
Abstract: Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on outof-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model’s ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations will be released upon acceptance.
PaperID: 353,   Poster  https://arxiv.org/pdf/2604.02252     GitHub
Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias
Title: SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring finegrained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. We will release code and models.
PaperID: 354,   Poster  https://arxiv.org/pdf/2512.16920     GitHub
Authors: Jinjie Mai, Chaoyang Wang, Guocheng Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei
Title: EasyV2V: A High-quality Instruction-based Video Editing Framework
Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization.We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instructionbased video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold.On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model.For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images.Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Code and data will be released upon approval.
PaperID: 355,   Poster  https://arxiv.org/pdf/2505.17132     GitHub
Authors: Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, Ting Wang
Title: Dynamic Token Reweighting for Robust Vision-Language Models
Abstract: Large visionlanguage models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model’s key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model’s general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that DTR outperforms existing defenses in both attack robustness and benign-task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. (The code for replicating DTR is included in the supplementary materials.)
PaperID: 356,   Poster  https://arxiv.org/pdf/2601.08881     GitHub
Authors: Yu Xu, Hongbin Yan, Juan Cao, YIJI CHENG, Tiankai Hang, Runze He, Zijin Yin, Shiyi Zhang, Yuxin Zhang, Jintao Li, Chunyu Wang, qinglin lu, Tong-yee Lee, Fan Tang
Title: TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
Abstract: Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subjectdriven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference.In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.
PaperID: 357,   Poster  https://arxiv.org/pdf/2603.03617     GitHub
Authors: Hao Li, Yuhao Wang, Wenning Hao, Pingping Zhang, Dong Wang, Huchuan Lu
Title: RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
Abstract: RGBThermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios.
PaperID: 358,   Poster  https://arxiv.org/pdf/2601.11035     GitHub
Authors: Long Ma, Zihao Xue, Yan Wang, Zhiyuan Yan, Jin Xu, Xiaorui Jiang, Haiyang Yu, Yong Liao, Zhen Bi
Title: Your One-Stop Solution for AI-Generated Video Detection
Abstract: Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field.From the dataset perspective, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness.From the benchmark perspective, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and indepth analysis yet to be systematically explored.Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering31state-of-the-art generation models and over440,000videos. By executing more than1,500evaluations on33existing detectors belonging to four distinct categories. This work presents8 in-depth analysesfrom multiple perspectives and identifying4 novel findingsthat offer valuable insights for the field. We hope this work provides a solid foundation for advancing the field of AI-generated video detection.
PaperID: 359,   Poster  https://arxiv.org/pdf/2603.27040     GitHub
Authors: Guanhe Huang, Oya Celiktutan
Title: Unified Number-Free Text-to-Motion Generation Via Flow Matching
Abstract: Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domainspecific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF’s effectiveness as a generalist model for multi-person motion generation from text. We will release the code.
PaperID: 360,   Poster  https://arxiv.org/pdf/2507.06233     GitHub
Authors: Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Gabriel Huang, Honglak Lee, Joon-Young Lee, Seungryong Kim
Title: AnthroTAP: Learning Point Tracking with Real-World Motion
Abstract: Point tracking models often struggle to generalize to realworld videos because large-scale training data is predominantly synthetic--the only source currently feasible to produce at scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames. We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement-non-rigid deformations, articulated motion, and frequent occlusions—AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency. A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, outperforming recent self-supervised teacher-student models trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs. AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking. Code and datasets will be made publicly available.
PaperID: 361,   Poster  https://arxiv.org/pdf/2602.18977     GitHub
Authors: Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg
Title: Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
Abstract: Adapting imagepretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle).To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence, that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.
PaperID: 362,   Poster  https://arxiv.org/pdf/2603.17355     GitHub
Authors: Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Yang, Liting Wen, Laszlo Jeni
Title: OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
Abstract: Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to worldcoordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, efficiency, faithfulness, and temporal consistency. Built upon the TRAM architecture, OnlineHMR enables streaming inference via a causal key–value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physical plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing.
PaperID: 363,   Poster  https://arxiv.org/pdf/2602.11673     GitHub
Authors: Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian
Title: RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Abstract: 3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, textto-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations.
PaperID: 364,   Poster  https://arxiv.org/pdf/2511.17952     GitHub
Authors: LIANGYANG OUYANG, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, Yoichi Sato
Title: Multi-speaker Attention Alignment for Multimodal Social Interaction
Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and nonverbal cues: who is speaking, to whom, and with what gaze or gestures.While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images.To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker’s visual representation and their utterances without introducing trainable parameters or architectural changes.We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results.Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning.
PaperID: 365,   Poster  https://arxiv.org/pdf/2603.03101     GitHub
Authors: Jun Yeong Park, JunYoung Seo, Minji Kang, Yu Rang Park
Title: MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
Abstract: The CLIP model's outstanding generalization has driven recent success in ZeroShot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. We provide our code in the supplementary material.
PaperID: 366,   Poster  https://arxiv.org/pdf/2512.04534     GitHub
Authors: Youze Huang, Penghui Ruan, Bojia Zi, Xianbiao Qi, Jianan Wang, Rong Xiao
Title: Refaçade: Editing Object with Given Reference Texture
Abstract: Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task,Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability due to two reasons: conditioning on the raw reference image introduces unwanted structural information, and this method fails to disentangle visual texture and structure information of the source. To address this problem, we proposed a method, namelyRefaçade, that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving geometry and motion of source videos. Second, we disrupt the reference’s global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than global layout of object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations.
PaperID: 367,   Poster  https://arxiv.org/pdf/2601.10909     GitHub
Authors: Chuqiao Li, Xianghui Xie, Yong Cao, Andreas Geiger, Gerard Pons-Moll
Title: FrankenMotion: Part-level Human Motion Generation and Composition
Abstract: Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequencelevel or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts.In this work, we construct a high-quality motion captioning dataset with atomic, temporally-aware part-level annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution.Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.
PaperID: 368,   Poster  https://arxiv.org/pdf/2603.07057     GitHub
Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Title: SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
Abstract: Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common trainingfree techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-\alpha, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. The code will be made publicly available.
PaperID: 369,   Poster  https://arxiv.org/pdf/2506.08002     GitHub
Authors: Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari
Title: Aligning Text, Images and 3D Structure Token-by-Token
Abstract: Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a threedimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed "cookbook" outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks – rendering, recognition, instruction-following, and question-answering – and four 3D datasets, synthetic and real-world. We show our model’s effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks.
PaperID: 370,   Poster  https://arxiv.org/pdf/2512.21003     GitHub
Authors: Xiangzuo Wu, Chengwei Ren, Jun Zhou, Xiu Li, Yuan Liu
Title: MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds
Abstract: Multiview inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.
PaperID: 371,   Poster  https://arxiv.org/pdf/2603.14176     GitHub
Authors: Bang-Dang Pham, Anh Tran, Cuong Pham, Minh Nguyen Nguyen
Title: BluRef: Unsupervised Image Deblurring with Dense-Matching References
Abstract: This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudoground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.
PaperID: 372,   Poster  https://arxiv.org/pdf/2603.27060     GitHub
Authors: Jihwan Hong, Jaeyoung Do
Title: VIRST: Video-Instructed Reasoning assistant for SpatioTemporal Segmentation
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, CLIP based and keyframe based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries that require multi step reasoning, which leads to sharp performance drops on motion intensive and reasoning oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (VideoInstructed Reasoning assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation aware video features into the vision language backbone, and employs the Temporal Dynamic Anchor Updater (TDAU) to maintain dynamically updated anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings.
PaperID: 373,   Poster  https://arxiv.org/pdf/2512.02268     GitHub
Authors: Jeremy Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao, Nomin-Erdene Bayarsaikhan, Daniele Visioni, Andrew Y. Ng, Duncan Watson-Parris
Title: Spatiotemporal Pyramid Flow Matching for Climate Emulation
Abstract: Generative models have the potential to transform the way we emulate Earth’s changing climate. Previous generative approaches rely on weatherscale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at [anonymized for review].
PaperID: 374,   Poster  https://arxiv.org/pdf/2603.00918     GitHub
Authors: Seungwook Kim, Minsu Cho
Title: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Abstract: Textto-image generation powers content creation across design, media, and data augmentation.Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.We introduce ARC (Adaptive Rewarding by Self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes.ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models.Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
PaperID: 375,   Poster  https://arxiv.org/pdf/2602.22917     GitHub
Authors: Hongzhao Li, Hao Dong, Hualei Wan, Shupan Li, Mingliang Xu, Muhammad Haris Khan
Title: Towards Multimodal Domain Generalization with Few Labels
Abstract: Multimodal models ideally should generalize to unseen domains while remaining dataefficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code will be released to support future research.
PaperID: 376,   Poster  https://arxiv.org/pdf/2505.17012     GitHub
Authors: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Abstract: Existing studies on multimodal large language models (MLLMs) in spatial understanding are typically limited by fragmented assessments.This work considers a comprehensive evaluation of the spatial understanding abilities of existing MLLMs. Concretely, we make the following contributions in this paper: (i) we proposeSpatialScore, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and QA formats with around 5K manually verified samples across 30 distinct tasks; (ii) we constructSpatialCorpus, a largescale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we developSpaitalAgent, a multi-agent system incorporating 12 specialized spatial perception tools, supporting bothPlan-ExecuteandReActreasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.
PaperID: 377,   Poster  https://arxiv.org/pdf/2604.06720     GitHub
Authors: Zhiqiang Liu, Rui Song, Chuanqi DuanMu, Jiaojiao Li, David Ferstl, Yinlin Hu
Title: Exploring 6D Object Pose Estimation with Deformation
Abstract: We present DeSOPE, a largescale dataset designed for Deformed Six-DoF Object Pose Estimation. Most existing 6D object pose approaches assume rigid or articulated objects, leaving deformed daily objects largely unexplored. This gap limits the realism and robustness of current pose estimation methods, which often fail when objects deviate from their canonical shapes due to wear, collision, or deformation. To address this issue, we present DeSOPE, a large-scale real-world dataset specifically designed for deformed object pose estimation. DeSOPE contains two major components: (1) a collection of high-fidelity 3D scans of 26 common object categories, each captured in one canonical and three deformed states using a non-rigid alignment framework; and (2) a real-scene RGB-D dataset comprising 133K frames and 665K pose annotations across 104 deformed instances, recorded in both static and dynamic scenarios. The varying degrees of deformation introduce substantial geometric and textural changes, presenting new challenges for existing methods. We benchmark several state-of-the-art algorithms on DeSOPE and demonstrate significant performance degradation as deformation increases, highlighting the limitations of current pose estimators. As the first large-scale dataset designed for systematic study of deformed object pose estimation, DeSOPE lays the groundwork for developing 6D pose estimators capable of handling real-world deformation and variability.
PaperID: 378,   Poster  https://arxiv.org/pdf/2603.00609     GitHub
Authors: Changxing Liu, Zichen Chao, Siheng Chen
Title: Linking Modality Isolation in Heterogeneous Collaborative Perception
Abstract: Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never cooccur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released.
PaperID: 379,   Poster  https://arxiv.org/pdf/2511.18359     GitHub
Authors: Alexandros Stergiou
Title: TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
Abstract: How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in textto-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.
PaperID: 380,   Poster  https://arxiv.org/pdf/2510.23116     GitHub
Authors: Hebaixu Wang, Jing Zhang, Haoyang Chen, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Title: Residual Diffusion Bridge Model for Image Restoration
Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Additionally, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the stateof-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks.
PaperID: 381,   Poster  https://arxiv.org/pdf/2512.14133     GitHub
Authors: Tianyi Xie, Yunuo Chen, Yaowei Guo, Yin Yang, Bolei Zhou, Demetri Terzopoulos, Ying Jiang, Chenfanfu Jiang
Title: AnimaMimic: Imitating 3D Animation from Video Priors
Abstract: Creating realistic 3D animation remains a timeconsuming and expertise-dependent process, requiring manual rigging, keyframing, and fine-tuning of complex motions. Meanwhile, video diffusion models have recently demonstrated remarkable motion imagination in 2D, generating dynamic and visually coherent motion from text or image prompts. However, their results lack explicit 3D structure and cannot be directly used for animation or simulation. We present AnimaMimic, a framework that animates static 3D meshes using motion priors learned from video diffusion models. Starting from an input mesh, AnimaMimic synthesizes a monocular animation video, automatically constructs a skeleton with skinning weights, and refines joint parameters through differentiable rendering and video-based supervision. To further enhance realism, we integrate a differentiable simulation module that refines mesh deformation through physically grounded soft-tissue dynamics. Our method bridges the creativity of video diffusion and the structural control of 3D rigged animation, producing physically plausible, temporally coherent, and artist-editable motion sequences that integrate seamlessly into standard animation pipelines.
PaperID: 382,   Poster  https://arxiv.org/pdf/2512.02650     GitHub
Authors: Junwon Lee, Juhan Nam, Jiyoung Lee
Title: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Abstract: This work introduces a new task, textconditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MonoAudio, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available in the supplementary material.
PaperID: 383,   Poster  https://arxiv.org/pdf/2512.15716     GitHub
Authors: Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu
Title: Spatia: Video Generation with Updatable Spatial Memory
Abstract: Existing video generation models struggle to maintain longterm spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory–aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic–static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
PaperID: 384,   Poster  https://arxiv.org/pdf/2602.07544     GitHub
Authors: Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth
Title: MUFASA: A Multi-Layer Framework for Slot Attention
Abstract: Unsupervised objectcentric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
PaperID: 385,   Poster  https://arxiv.org/pdf/2603.24454     GitHub
Authors: Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang
Title: Unleashing Vision-Language Semantics for Video Deepfake Detection
Abstract: Recent video Deepfake Detection (DFD) studies have demonstrated that pretrained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength — the rich vision-language semantics embedded in the latent space. We proposes VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics in enhancing model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture subtle and diverse forgery cues both granularly and holistically, while preserving the pretrained Vision–Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue — Identity-aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels.
PaperID: 386,   Poster  https://arxiv.org/pdf/2512.03041     GitHub
Authors: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
Title: MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Abstract: Current video generation techniques excel at singleshot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
PaperID: 387,   Poster  https://arxiv.org/pdf/2604.04681     GitHub
Authors: Qing Zhou, Bingxuan Zhao, Tao Yang, Hongyuan Zhang, Junyu Gao, Qi Wang
Title: Batch Loss Score for Dynamic Data Pruning
Abstract: Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While persample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (three-line injection) and readily adapts existing per-sample loss-based methods (one-line proxy). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune 20%-50% of samples across 14 datasets, 11 tasks and 18 models, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access.
PaperID: 388,   Poster  https://arxiv.org/pdf/2603.19176     GitHub
Authors: Amandine Brunetto
Title: Few-shot Acoustic Synthesis with Multimodal Flow Matching
Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scenespecific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context.We introduce FLow-matching ACoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues.FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint Acoustic–GeometRy EmbEdding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics.This work is the first to apply generative flow matching to acoustics, establishing a new direction for robust and data-efficient acoustic synthesis.
PaperID: 389,   Poster  https://arxiv.org/pdf/2507.23277     GitHub
Authors: Gyeongjin Kang, Seungtae Nam, Seung kwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park
Title: iLRM: An Iterative Large 3D Reconstruction Model
Abstract: Feedforward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.
PaperID: 390,   Poster  https://arxiv.org/pdf/2512.08529     GitHub
Authors: Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu
Title: MVP: Multiple View Prediction improves GUI grounding
Abstract: GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability—minor visual perturbations (e.g., cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with highresolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%.
PaperID: 391,   Poster  https://arxiv.org/pdf/2601.04090     GitHub
Authors: Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao
Title: Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
Abstract: We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scenelevel 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \method produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
PaperID: 392,   Poster  https://arxiv.org/pdf/2603.02692     GitHub
Authors: Aro Kim, Myeongjin Jang, Chaewon Moon, Youngjin Shin, Jinwoo Jeong, Sang-hyo Park
Title: FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
Abstract: Diffusionbased approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration.
PaperID: 393,   Poster  https://arxiv.org/pdf/2512.12430     GitHub
Authors: Ke Zhang, Jiacong Xu, Yiqun Mei, Vishal M. Patel
Title: Endless World: Real-Time 3D-Aware Long Video Generation
Abstract: Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a realtime framework for infinite, 3D-consistent video generation. To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead. Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis. Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our code will be released after review.
PaperID: 394,   Poster  https://arxiv.org/pdf/2603.06242     GitHub
Authors: Han-Chen Zhang, Zi-Hao Zhou, Mao-Lin Luo, Shimin Di, Min-Ling Zhang, Tong Wei
Title: DC-Merge: Improving Model Merging with Directional Consistency
Abstract: Model merging aims to integrate multiple taskadapted models into a unified model that preserves the knowledge of each task.In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings.
PaperID: 395,   Poster  https://arxiv.org/pdf/2604.14062     GitHub
Authors: Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan
Title: OneHOI: Unifying Human-Object Interaction Generation and Editing
Abstract: HumanObject Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed astriplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our new HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code and dataset will be open publicly.
PaperID: 396,   Poster  https://arxiv.org/pdf/2512.02006     GitHub
Authors: Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyeon Park, Jaeyeong Kim, Jung Yi, Seokju Cho, Seungryong Kim
Title: MV-TAP: Tracking Any Point in Multi-View Videos
Abstract: Multiview camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to many applications. Point tracking serves as a key mechanism for capturing dynamic motion; however, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view scenarios. In this work, we present \ours, a robust point tracker that tracks query points across multi-view videos of dynamic scenes by leveraging cross-view information.\ours utilizes camera geometry and cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that \ours outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.
PaperID: 397,   Poster  https://arxiv.org/pdf/2509.04394     GitHub
Authors: ZiDong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, Lei Bai
Title: Transition Models: Rethinking the Generative Learning Objective
Abstract: A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient fewstep alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval (\Delta t). This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps.Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to (4096×4096). All the codes and model checkpoints will be released.
PaperID: 398,   Poster  https://arxiv.org/pdf/2604.15312     GitHub
Authors: Ninghui Xu, Fabio Tosi, Lihui Wang, Jiawei Han, Luca Bartolomei, Zhiting Yao, Matteo Poggi, Stefano Mattoccia
Title: Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
Abstract: Conventional framebased cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.
PaperID: 399,   Poster  https://arxiv.org/pdf/2602.23153     GitHub
Authors: Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi
Title: Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pretrained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We introduce Fase3D, the first efficient encoder-free Fourier-based 3D LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. We will release code and models publicly.
PaperID: 400,   Poster  https://arxiv.org/pdf/2604.17082     GitHub
Authors: Xingyuan Yu, Yijin Li, Chong Zeng, Yuhang Ming, Hujun Bao, Guofeng Zhang
Title: D-Prism: Differentiable Primitives for Structured Dynamic Modeling
Abstract: Capturing both geometry and rigid motion for structured dynamic objects, like multipart assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive counts, better matching objects' true spatial footprint.Experiments confirm that our method excels at structured dynamic modeling, providing both structured geometry and precise motion tracking.
PaperID: 401,   Poster  https://arxiv.org/pdf/2604.20570     GitHub
Authors: Muzhi Zhu, Shunyao Jiang, Huanyi Zheng, Zekai Luo, Hao Zhong, Anzhou Li, Kaijun Wang, Jintao Rong, Yang Liu, Hao Chen, Tao Lin, Chunhua Shen
Title: Exploring Spatial Intelligence from a Generative Perspective
Abstract: Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI)—the ability to respect and manipulate 3D spatial constraints during image generation—and whether such capability can be measured or improved. We introduce GSIBench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling.Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning—establishing a new pathway for advancing spatial intelligence in multimodal models.
PaperID: 402,   Poster  https://arxiv.org/pdf/2512.00387     GitHub
Authors: Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu, zehan wang, Yun Zhu, Juncheng Li, Siliang Tang
Title: WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
Abstract: Recent image editing models boast nextlevel intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps—Awareness, Interpretation, and Imagination—each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities.
PaperID: 403,   Poster  https://arxiv.org/pdf/2604.07774     GitHub
Authors: Peiran Xu, Jiaqi Zheng, Yadong Mu
Title: Chaining Basic Capabilities for Embodied Task Planning
Abstract: This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent VisionLanguage Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose a capability-driven planning pipeline, in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model’s performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach.
PaperID: 404,   Poster  https://arxiv.org/pdf/2604.20336     GitHub
Authors: Jiahao Xu, Xiaohan Yuan, Xingchen Wu, Chongyang Xu, Kun Li, Buzhen Huang
Title: Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
Abstract: Comanipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object’s affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code will be made publicly available.
PaperID: 405,   Poster  https://arxiv.org/pdf/2512.05025     GitHub
Authors: Nicolas Houdré, Diego Marcos, Hugo Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry
Title: RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
Abstract: Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from highresolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model will be released upon acceptance.
PaperID: 406,   Poster  https://arxiv.org/pdf/2603.10583     GitHub
Authors: Hongsong Wang, Renxi Cheng, Chaolei Han, Jie Gui
Title: Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution
Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AIgenerated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation.Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings.
PaperID: 407,   Poster  https://arxiv.org/pdf/2603.24800     GitHub
Authors: Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev
Title: Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an indepth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~ 10^2 parameters. Additionally, Calibri introduces an innovative inference-time ensemble scaling strategy to further boost generative performance. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.
PaperID: 408,   Poster  https://arxiv.org/pdf/2512.20563     GitHub
Authors: Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, Kashyap Chitta
Title: LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
Abstract: Simulationgenerated datasets for autonomous driving rely on omniscient data collection 'expert' policies, which use unobservable scene information (e.g., from occluded regions) to make driving decisions.When such data is used for end-to-end policy training, it results in an information asymmetry between the expert and the 'learner' policy, which has limited sensor coverage and navigational intent information compared to the expert. We show that this asymmetry leads to a significant drop in the performance of the learner. To combat this, we present LEAD, a new high-quality synthetic dataset collected in the CARLA simulator with three key improvements.(1) The expert minimizes its use of unobservable information by removing entities from its input state that would be occluded in the learner's field of view.By providing the learner with (2) detailed driver intent information and (3) rich sensor modalities (cameras, LiDARs, radars, and odometry), the dataset narrows down the information gap between the learner and expert. We then propose TransFuser v6 (TFv6), a simple end-to-end learner policy trained on LEAD.As a result of our improvements, TFv6 substantially advances the state of the art on all publicly available CARLA closed-loop driving benchmarks, reaching driving scores of 95 on Bench2Drive, 62 on Longest6 v2, and 15 on the Town13 validation routes.Finally, we aggregate the LEAD dataset with several public real-world datasets under a unified repository to enable cross-dataset evaluation.We show that pre-training TFv6 on synthetic data from LEAD leads to consistent performance gains when followed by fine-tuning with real data from the NAVSIM v1, NAVSIM v2, and WOD-E2E benchmarks.
PaperID: 409,   Poster  https://arxiv.org/pdf/2511.10560     GitHub
Authors: Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu
Title: OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGBonly inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
PaperID: 410,   Poster  https://arxiv.org/pdf/2603.14021     GitHub
Authors: wanhu sun, Zhongjin Luo, Heliang Zheng, Jiahao Chang, Chongjie Ye, Huiang He, Shengchu Zhao, Rongfei Jia, Xiaoguang Han
Title: EI-Part:Explode for Completion and Implode for Refinement
Abstract: Partlevel 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabling flexible part completion and fine geometric details generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both the exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments conducted on various benchmarks demonstrate that EI-Part efficiently yields semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level generation compared to existing methods.
PaperID: 411,   Poster  https://arxiv.org/pdf/2603.24965     GitHub
Authors: Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He
Title: Self-Corrected Image Generation with Explainable Latent Rewards
Abstract: Despite significant progress in textto-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable Latent Rewards. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors, offering a data-efficient and generalizable solution to bridging the gap between comprehension and synthesis.
PaperID: 412,   Poster  https://arxiv.org/pdf/2603.08064     GitHub
Authors: Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou
Title: Evaluating Generative Models via One-Dimensional Code Distributions
Abstract: Most evaluations of generative models rely on featuredistribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emphdiscrete visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance(CHD), a training-free distribution metric in token space, and Code Mixture Model Score(CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 11 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
PaperID: 413,   Poster  https://arxiv.org/pdf/2604.15521     GitHub
Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Liang-Chieh Chen
Title: Frequency-Aware Flow Matching for High-Quality Image Generation
Abstract: Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is nonuniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled—low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code will be made available.
PaperID: 414,   Poster  https://arxiv.org/pdf/2603.03197     GitHub
Authors: Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang
Title: Specificity-aware reinforcement learning for fine-grained open-world classification
Abstract: Classifying finegrained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. We will release both code and model.
PaperID: 415,   Poster  https://arxiv.org/pdf/2604.08846     GitHub
Authors: Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Dinh Tran, Mubarak Shah, Rene Vidal
Title: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Abstract: Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent works use prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the intermediate activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safetyrelated concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that jointly utilizes a curated concept dictionary with a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli (we name the dataset DACO-400K) and summarizing their activations into per-concept directions. Second, we show that the curated dictionary can be directly applied to interventions via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves the MLLM safety while maintaining general-purpose capabilities.
PaperID: 416,   Poster  https://arxiv.org/pdf/2603.21701     GitHub
Authors: Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
Title: Rethinking Token Reduction for Large Vision-Language Models
Abstract: Large VisionLanguage Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency–accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code will be released to facilitate future research.
PaperID: 417,   Poster  https://arxiv.org/pdf/2601.07603     GitHub
Authors: Zijian Wu, Boyao Zhou, Liangxiao Hu, Hongyu Liu, Yuan Sun, Xuan Wang, Xun Cao, Yujun Shen, Hao Zhu
Title: Uika: Universal Head Avatar from Pose-Free Images
Abstract: We present UIKA, a feedforward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV token can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. Such a Gaussian avatar is directly animatable via standard linear blend skinning and supports real-time rendering. To train our large avatar model, we further prepare a large-scale, identity-rich training dataset with controllable views and motions, synthesized with a 3D GAN and a state-of-the-art image animation model. Our proposed method significantly outperforms existing approaches in rendering quality, 3D consistency, and inference efficiency on both single-view and multi-view input data.
PaperID: 418,   Poster  https://arxiv.org/pdf/2604.10994     GitHub
Authors: Joanna Kaleta, Piotr Wójcik, Kacper Marzol, Tomasz Trzciński, Kacper Kania, Marek Kowalski
Title: LumiMotion: Improving Gaussian Relighting with Scene Dynamics
Abstract: In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splattingbased methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting.
PaperID: 419,   Poster  https://arxiv.org/pdf/2603.21217     GitHub
Authors: shenghan chen, Yiming Liu, Yanzhen Wang, Yujia Wang, Xiankai Lu
Title: Reframing Long-Tailed Learning via Loss Landscape Geometry
Abstract: Balancing performance tradeoff on long-tail data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "catastrophic forgetting'' in continual learning (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "catastrophic forgetting''. To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods.
PaperID: 420,   Poster  https://arxiv.org/pdf/2603.20584     GitHub
Authors: Liangyu Yuan, Yufei Huang, Mingkun Lei, Tong Zhao, Ruoyu Wang, Chi Changxi, Yiwei Wang, Chi Zhang
Title: Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
Abstract: Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulationfree objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SEG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SEG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SEG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings.
PaperID: 421,   Poster  https://arxiv.org/pdf/2603.10872     GitHub
Authors: Yan Zhang, Long Ma, Yuxin Feng, Zhe Huang, Fan Zhou, Zhuo Su
Title: Bilevel Layer-Positioning LoRA for Real Image Dehazing
Abstract: Learningbased real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP’s cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The source code will be publicly available upon acceptance.
PaperID: 422,   Poster  https://arxiv.org/pdf/2601.02785     GitHub
Authors: Mengtian Li, Jinshu Chen, Songtao Zhao, Wanquan Feng, Pengqi Tu, Qian HE
Title: DreamStyle: A Unified Framework for Video Stylization
Abstract: Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes longvideo stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.
PaperID: 423,   Poster  https://arxiv.org/pdf/2603.09408     GitHub
Authors: Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann M. Weber, Vinicius C. Azevedo
Title: Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures.Yet the locality bias, parameter efficiency, and hardware friendliness—the attributes that established ConvNets as the efficient vision backbone—have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a ConvNeXtinspired backbone redesigned for conditional diffusion modeling. We find that FCDM-XL, using only 50% of the FLOPs of DiT-XL/2, achieves comparable performance while delivering 7× and 7.5× speedups at 256×256 and 512×512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
PaperID: 424,   Poster  https://arxiv.org/pdf/2603.00543     GitHub
Authors: Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang
Title: Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Abstract: Pansharpening aims to generate highresolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
PaperID: 425,   Poster  https://arxiv.org/pdf/2603.16649     GitHub
Authors: Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou
Title: Mixture of Style Experts for Diverse Image Stylization
Abstract: Diffusionbased stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details. We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE).Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles.
PaperID: 426,   Poster  https://arxiv.org/pdf/2603.08075     GitHub
Authors: Yanan Wu, Yuhan Yan, Tailai Chen, Zhixiang Chi, ZiZhang Wu, Yi Jin, Yang Wang, Zhenbo Li
Title: TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
Abstract: Onthe-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion.
PaperID: 427,   Poster  https://arxiv.org/pdf/2507.10800     GitHub
Authors: Ali Hojjat, Janek Haberer, Soeren Pirk, Olaf Landsiedel
Title: ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Abstract: ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware.Recent Matryoshkastyle Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies.To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty.ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early.Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity.To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage.Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K.We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin.The source code is available at submittedinzip.
PaperID: 428,   Poster  https://arxiv.org/pdf/2512.05272     GitHub
Authors: Ahmet Berke Gökmen, Ajad Chhatkuli, Luc Van Gool, Danda Paudel
Title: Inferring Compositional 4D Scenes without Ever Seeing One
Abstract: Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data.At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples.By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos.COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
PaperID: 429,   Poster  https://arxiv.org/pdf/2603.20741     GitHub
Authors: Xiefan Guo, Xinzhu Ma, Haiyu Zhang, Di Huang
Title: CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Abstract: Recent advancements in textto-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., Stable Diffusion 2.1) and flow-based approaches (e.g., Stable Diffusion 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal.
PaperID: 430,   Poster  https://arxiv.org/pdf/2603.03265     GitHub
Authors: Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhoefer
Title: DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
Abstract: We present DuoMo, a generative method that recovers human motion in worldspace coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly (bypassing parametric models). DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space MPJPE error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space MPJPE error.
PaperID: 431,   Poster  https://arxiv.org/pdf/2511.02946     GitHub
Authors: Srikumar Sastry, Subash Khanal, Aayush Dhakal, Jiayu Lin, Daniel Cher, Phoenix Jarosz, Nathan Jacobs
Title: ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Abstract: We introduce ProM3E, a probabilistic masked multimodal embedding model for anyto-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released.
PaperID: 432,   Poster  https://arxiv.org/pdf/2405.18716     GitHub
Authors: Chaitat Utintu, Yi-Zhe Song
Title: SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
Abstract: We introduce SketchDeco, a trainingfree approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.
PaperID: 433,   Poster  https://arxiv.org/pdf/2602.18996     GitHub
Authors: Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing LYU, Chun Yuan, Fengyun Rao
Title: Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
Abstract: We study the task of establishing objectlevel visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code will be released upon acceptance.
PaperID: 434,   Poster  https://arxiv.org/pdf/2604.01974     GitHub
Authors: Yuqing Huang, Guotian Zeng, Zhenqiao Yuan, Zhenyu He, Xin Li, Yaowei Wang, Ming-Hsuan Yang
Title: Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Abstract: Existing visual trackers mainly operate in a noninteractive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios—strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance.
PaperID: 435,   Poster  https://arxiv.org/pdf/2603.04239     GitHub
Authors: Mengping Yang, Stewart Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao li
Title: DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
Abstract: Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256 × 256 and 512 × 512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging onestep generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance. Our code and models will be released.
PaperID: 436,   Poster  https://arxiv.org/pdf/2604.12356     GitHub
Authors: Dongjian Yu, Weiqing Min, Qian Jiang, Xing Lin, Xin Jin, Shuqiang Jiang
Title: OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
Abstract: Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets focus on Western cuisines, with limited coverage of Chinese dishes, leading to limitations in accurate nutritional estimation for Chinese meals. Moreover, many stateof-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food scenes with detailed nutritional annotations and multi-view images for each scene. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework to predict nutritional information from a single RGB image. We first predict a depth map from a single RGB image, then refine it using our Scale-Shift Residual Adapter (SSRA), which enforces global scale consistency and preserves local structural details. Second, the Frequency-Aligned Fusion Module (FAFM) hierarchically fuses RGB and adapted depth features, aligning multi-modal representations in the frequency domain across layers. Third, the Mask-based Prediction Head (MPH) emphasizes key ingredient regions via dynamic channel selection, improving prediction accuracy. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, providing a practical solution for daily dietary assessment.
PaperID: 437,   Poster  https://arxiv.org/pdf/2512.16615     GitHub
Authors: Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan
Title: Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic selfattention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure.In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level,and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks.We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27 × and DiT training by 6.09 × on 256 × 256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently.
PaperID: 438,   Poster  https://arxiv.org/pdf/2511.18570     GitHub
Authors: Samarth Chopra, Jing Liang, Gershom Seneviratne, Dinesh Manocha
Title: PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
Abstract: Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesianinferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation.
PaperID: 439,   Poster  https://arxiv.org/pdf/2603.21426     GitHub
Authors: Jingchen Sun, Shaobo Han, Deep Patel, Wataru Kohno, Can Jin, Changyou Chen
Title: Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Abstract: Knowledge distillation establishes a learning paradigm that learns from both data supervision and teacher guidance. However, the optimal weighting between learning from data and learning from the teacher is hard to determine, as some samples are datanoisy while others are teacher-uncertain. This raises a pressing need to adaptively balance data and teacher supervision. We propose Beta-weighted Knowledge Distillation \beta-KD, an adaptive, uncertainty-aware knowledge distillation framework that supports arbitrary distillation objectives under a unified Bayesian formulation. Specifically, we model teacher signals as a Gibbs prior over student activations and use amortized optimization to jointly infer activations and weighting parameters \beta, leading to a closed-form, uncertainty-aware weighting. Extensive experiments distilling a 1.7B-parameter student from MobileVLM-7B demonstrate that \beta-KD consistently outperforms existing methods under different loss combination settings. Moreover, large-scale distillation and evaluations on six multimodal benchmarks further confirm the effectiveness of the proposed approach.
PaperID: 440,   Poster  https://arxiv.org/pdf/2603.12766     GitHub
Authors: Shifeng Chen, Yihui Li, Jun Liao, Hongyu Yang, Di Huang
Title: Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
Abstract: Recent advances in 3D scene editing using NeRF and 3DGS enable highquality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.
PaperID: 441,   Poster  https://arxiv.org/pdf/2511.13720     GitHub
Authors: Tianhong Li, Kaiming He
Title: Back to Basics: Let Denoising Generative Models Denoise
Abstract: Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a lowdimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", orJiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
PaperID: 442,   Poster  https://arxiv.org/pdf/2603.21936     GitHub
Authors: Roy Amoyal, Oren Freifeld, Chaim Baskin
Title: Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
Abstract: We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g, the same car) and often must be given true scale as input, while we estimate it successfully. Our approach leverages viewpointguided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns models while keeping the 3DGS models fixed. First, we perform an iterative, feature-guided coarse registration that is robust to extremely poor initialization (e.g, 180° misalignment or a 10× scale gap), followed by a fine registration step enforcing multi-view feature consistency, inspired by inverse radiance-field formulations.The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Code will be released upon acceptance.
PaperID: 443,   Poster  https://arxiv.org/pdf/2512.03520     GitHub
Authors: YIYI CAI, Yuhan Wu, Kunhang Li, YOU ZHOU, Bo Zheng, Haiyang Liu
Title: FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Abstract: We present FloodDiffusion, a new framework for textdriven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events.We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning.With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available.
PaperID: 444,   Poster  https://arxiv.org/pdf/2602.21810     GitHub
Authors: Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, Xiaoqin Zhang
Title: GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
Abstract: Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multistage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., \pi^3), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. Code will be made publicly available.
PaperID: 445,   Poster  https://arxiv.org/pdf/2602.21078     GitHub
Authors: Duowen Chen, Yan Wang
Title: ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
Abstract: Federated SemiSupervised Learning (FSSL) aims to collaboratively train a global model by leveraging unlabeled data and limited labeled data across clients in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients (external heterogeneity) and within clients (internal heterogeneity). Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (for external) and / or filter out low-confidence unlabeled samples directly by an empirical threshold to reduce mistakes in local client (for internal). But, the former is hard to precisely fit the ideal global category distribution due to external heterogeneity, and the latter results in fewer training participation of available samples in FL. To address these issues, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy to better fit the category distribution across clients; for internal, we re-include the discarded samples together with other samples into training based upon a positive-negative proxy pool rather than compromise on wrong pseudo-labels. Insight experiments & theoretical analysis show that ProxyFL significantly boost the FSSL performance and convergence.
PaperID: 446,   Poster  https://arxiv.org/pdf/2603.07142     GitHub
Authors: Xijun Lu, Hongying Liu, Fanhua Shang, Yanming hui, Liang Wan
Title: PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
Abstract: Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic GradCAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose PDD (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a Manifold Matching and Unification (MMU) module, while an Intra-Backbone Attention (InA) module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via InA for local consistency, while the other receives skip-projected representations through a Manifold Prior Affine (MPA) module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that PDD significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8%, 5.1%, and 2.9% in AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4% in F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection.
PaperID: 447,   Poster  https://arxiv.org/pdf/2512.13157     GitHub
Authors: Peter Kocsis, Lukas Höllein, Matthias Nießner
Title: Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Abstract: We introduce Intrinsic Image Fusion, a method that reconstructs highquality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.
PaperID: 448,   Poster  https://arxiv.org/pdf/2603.27542     GitHub
Authors: JongMin Lee, Seungyeop Kang, Sungjoo Yoo
Title: MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
Abstract: Establishing consistent correspondences across images is essential for 3D vision tasks such as structurefrom-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods.
PaperID: 449,   Poster  https://arxiv.org/pdf/2604.05436     GitHub
Authors: Gwanghyun Kim, Junghun James Kim, Suh Yoon Jeon, Jason Park, Se Young Chun
Title: Human Interaction-Aware 3D Reconstruction from a Single Image
Abstract: Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multihuman scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D.Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.
PaperID: 450,   Poster  https://arxiv.org/pdf/2603.17307     GitHub
Authors: 海洋 闫, Hongyun Zhou, Peng Xu, FengXiaoxue FengXiaoxue, Mengyi Liu
Title: Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with longform video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench.
PaperID: 451,   Poster  https://arxiv.org/pdf/2603.07952     GitHub
Authors: Yanning Hou, Peiyuan Li, Zirui Liu, Yitong Wang, Yanran Ruan, Jianfeng Qiu, Ke Xu
Title: VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
Abstract: Zeroshot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision–language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image–text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Our code will be made publicly available.
PaperID: 452,   Poster  https://arxiv.org/pdf/2604.09367     GitHub
Authors: Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue
Title: EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
Abstract: Ancient inscriptions, as repositories of cultural memory, have suffered centuries of environmental and humaninduced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations.Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formalizes inscription restoration as a hierarchical planning problem. Following an Observe–Conceive–Execute–Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods.Across real-world degraded inscriptions, EpiAgent delivers superior restoration quality and stronger generalization compared to existing methods. Our work marks a pivotal step toward expert-level agent-driven restoration of cultural heritage. Code will be released.
PaperID: 453,   Poster  https://arxiv.org/pdf/2508.19195     GitHub
Authors: Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang, Xuecheng Nie
Title: All-in-One Slider for Attribute Manipulation in Diffusion Models
Abstract: Textto-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released.
PaperID: 454,   Poster  https://arxiv.org/pdf/2603.14238     GitHub
Authors: Huan Wang, Jun Shen, Jun Yan, Guansong Pang
Title: Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Abstract: Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacypreserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model's representations to collapse into a narrow low-dimensional subspace. We then propose Federated Feature Decoupling and Calibration (F^2DC), which liberates valuable class-relevant information by calibrating the domain-specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in F^2DC to determine the robustness of each feature unit, thereby separating the local features into domain-robust features and domain-related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain-related features by explicitly linking discriminative signals, capturing additional class-relevant clues that complement the domain-robust features. Finally, a domain-aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi-domain datasets demonstrate the effectiveness of the proposed F^2DC and the contributions of its two modules.
PaperID: 455,   Poster  https://arxiv.org/pdf/2510.08527     GitHub
Authors: Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao
Title: FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Abstract: We present FlexTraj, a framework for imageto-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a temporally consistent trajectory ID, a segmentation ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.
PaperID: 456,   Poster  https://arxiv.org/pdf/2512.24965     GitHub
Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
Title: ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving humanlike automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-π, the first flow-based generative model as GUI dexterous hand, featuring the following designs:(i) Unified Discrete–Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes;(ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories;(iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents’ drag capabilities.Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-π achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world.
PaperID: 457,   Poster  https://arxiv.org/pdf/2603.25247     GitHub
Authors: Taejin Jeong, Joohyeok Kim, Jinyeong Kim, Chanyoung Kim, Seong Jae Hwang
Title: FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Abstract: Spatial Transcriptomics (ST) provides spatiallyresolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions.
PaperID: 458,   Poster  https://arxiv.org/pdf/2512.07802     GitHub
Authors: Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie
Title: OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Abstract: Storytelling in realworld videos often unfolds through multiple shots—discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling. Our model and data will be released with the paper.
PaperID: 459,   Poster  https://arxiv.org/pdf/2511.22715     GitHub
Authors: Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even stateof-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Source code and models will be made publicly available.
PaperID: 460,   Poster  https://arxiv.org/pdf/2603.15026     GitHub
Authors: Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
Title: Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Abstract: Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences that transform creative workflows.Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.Imagebased detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models.These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection.We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework.Across two public benchmarks including 20 generative models, STALL consistently outperforms prior image- and video-based baselines.To further test generalization, we curate ComGenVid, a new benchmark featuring state-of-the-art models (Sora and Veo-3), on which STALL demonstrates consistent and robust results.
PaperID: 461,   Poster  https://arxiv.org/pdf/2512.17012     GitHub
Authors: Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
Title: 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack regionlevel prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
PaperID: 462,   Poster  https://arxiv.org/pdf/2509.09676     GitHub
Authors: Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Bao Yajie, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
Title: SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and realworld fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collectSpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.Through extensive validation experiments, we demonstrate SpatialVID’s effectiveness across tasks such as controllable video generation, world simulation and geometric reconstruction, providing a strong foundation for spatial intelligence research.
PaperID: 463,   Poster  https://arxiv.org/pdf/2506.01078     GitHub
Authors: Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Guanghao Zhou, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao he, Zheming Yang, Ming Tang, Minghui Qiu, Jinqiao Wang
Title: GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Abstract: Despite recent advances in multimodal reasoning, Multimodal Large Language Models (MLLMs) still struggle on complex tasks where initial visual perceptions can be misleading. This performance gap stems from a critical reasoning flaw we term Visual Inertia: while MLLMs excel at iterative reflection in textual contexts, they tend to uncritically commit to their initial visual interpretations and rarely revise them. To overcome this limitation, we introduce GThinker, an MLLM equipped with a novel adaptive visual rethinking capability. GThinker leverages CueRethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers a re-examination of these cues to resolve inconsistencies. To instill this capability, we introduce a novel two-stage training framework. It begins with a pattern-guided cold start, enhanced by a judge-guided selective mechanism to learn from failure cases, followed by incentive reinforcement learning. We further curate the GThinker-11k dataset to power the training with an iterative multimodal annotation pipeline. Extensive experiments demonstrate that GThinker significantly mitigates visual inertia during reasoning, achieving a leading 81.5% on the M3CoT benchmark, which is rich in such challenges, surpassing the powerful O4-mini model. Furthermore, GThinker shows consistent improvements across a range of multimodal reasoning benchmarks with an average gain of 2.1%, showcasing the broad benefits of equipping MLLMs with the ability to rethink both what they see and how they think.
PaperID: 464,   Poster  https://arxiv.org/pdf/2511.17138     GitHub
Authors: Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
Title: One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
Abstract: Recent advances in diffusionbased real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.
PaperID: 465,   Poster  https://arxiv.org/pdf/2604.08536     GitHub
Authors: Onkar Susladkar, Dong-Hwan Jang, Tushar Prakash, Adheesh Juvekar, Vedant Shah, Ayush Barik, Nabeel Bashir, Muntasir Wahed, Ritish Shrirao, Ismini Lourentzou
Title: RewardFlow: Generate Images by Optimizing What You Reward
Abstract: RewardFlow is a zeroshot, training-free framework for text-guided image editing and generation based on reward-guided Langevin dynamics. We steer pretrained diffusion and flow-matching models at inference time using a diverse set of differentiable rewards, and control their influence with a prompt-aware adaptive policy that parses the text instruction, infers edit intent, and dynamically adjusts update steps. Our design includes a differentiable VQA-based reward for fine-grained semantic supervision and a SAM-guided reward for precise, localized edits with minimal leakage. Across standard image editing and compositional generation benchmarks, RewardFlow achieves state-of-the-art zero-shot edit fidelity and compositional alignment. We will release the code and an open-source demo upon acceptance of the paper.
PaperID: 466,   Poster  https://arxiv.org/pdf/2603.17651     GitHub
Authors: Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang
Title: Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
Abstract: Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation.As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment.Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path.Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframeanchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully.TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
PaperID: 467,   Poster  https://arxiv.org/pdf/2602.23013     GitHub
Authors: Camile Lendering, Erkut Akdag, Egor Bondarev
Title: SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Abstract: Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent fewshot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results.
PaperID: 468,   Poster  https://arxiv.org/pdf/2603.02872     GitHub
Authors: Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen
Title: Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in Chainof-Thought (CoT) reasoning. However, existing LVLM reasoning paradigms only begin reasoning after the entire video becomes available, introducing unnecessary latency and diminishing attention to early visual cues in dynamic scenes. Inspired by the human ability to think while watching, we introduce a streaming reasoning paradigm for LVLMs, where reasoning unfolds sequentially with incoming frames and deepens after the full video is observed. We instantiate this paradigm through Think-as-You-See (TaYS), a unified framework that enables LVLMs to reason while watching by integrating streaming CoT generation, stream-constrained training, and stream-parallel inference. Specifically, TaYS employs temporally aligned streaming reasoning units with precise CoT supervision, enforces ordered reasoning via streaming attention masks and positional encodings, and utilizes a parallel KV caches mechanism that decouples input encoding from reasoning generation, ensuring alignment and true concurrency. We evaluate TaYS on the Qwen2.5-VL model family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experimental results show that TaYS achieves superior reasoning performance compared with batch-mode CoT, while reducing pre-reasoning latency to under one second and overall answer delay by more than 50%. These findings demonstrate the effectiveness of the streaming paradigm in enabling real-time, human-like reasoning for LVLMs.
PaperID: 469,   Poster  https://arxiv.org/pdf/2603.26008     GitHub
Authors: Mahesh Bhosale, Abdul Wasi Lone, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong
Title: FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
Abstract: While powerful in imageconditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model’s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report–generation task shows FairLLaVA consistently reduces inter-group gaps while improving equity-scaled clinical performance and natural language generation metrics. Code and models will be open-sourced.
PaperID: 470,   Poster  https://arxiv.org/pdf/2603.27228     GitHub
Authors: Yanying Li, Jinyang Li, Shengfeng He, Yangyang Xu, Junyu Dong, Yong Du
Title: NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
Abstract: We present NimbusGS, a unified framework for reconstructing highquality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions.
PaperID: 471,   Poster  https://arxiv.org/pdf/2603.21356     GitHub
Authors: Yuqiu Liu, Jialin Song, Marissa Ramirez de Chanlatte, Rochishnu Chowdhury, Rushil Desai, Wuyang Chen, Daniel Martin, Michael Mahoney
Title: FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
Abstract: Real objects inhabit a physical world and must behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of realworld scenes from multi-view images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. We consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object function, beyond visual cues? We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. (1) We define our simulation-based uncertainty induced from fluid simulations that capture physical plausibility. (2) We integrate our uncertainty with NBV (next-best-view) policies to prioritize views that improve both visual and physical fidelity. On NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our method yields up to +8.6% PSNR, and -62.3% velocity divergence, with PSNR gains on function-critical surfaces of +7.7%.
PaperID: 472,   Poster  https://arxiv.org/pdf/2603.25968     GitHub
Authors: Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-teng Lin
Title: Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
Abstract: Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking AIgenerated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems.
PaperID: 473,   Poster  https://arxiv.org/pdf/2601.09708     GitHub
Authors: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
Title: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Abstract: VisionLanguage-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
PaperID: 474,   Poster  https://arxiv.org/pdf/2603.26316     GitHub
Authors: Cai Selvas-Sala, Lei Kang, Lluis Gomez
Title: SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Abstract: As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastivelytrained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. Both are trained from scratch on a 400M-pair \textttretain set to isolate unlearning effects. We propose a novel evaluation protocol with structured holdout sets (\textttholdout_identity, \textttholdout_association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.
PaperID: 475,   Poster  https://arxiv.org/pdf/2602.21395     GitHub
Authors: yongxin guo, Hao Lu, Onur Koyun, Zhengjie Zhu, Muhammet Demir, Metin Gurcan
Title: Momentum Memory for Knowledge Distillation in Computational Pathology
Abstract: Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology–genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batchlocal alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance.To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time.Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.
PaperID: 476,   Poster  https://arxiv.org/pdf/2511.20620     GitHub
Authors: Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei.Ma Yifei.Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng
Title: Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Abstract: Reproducible closedloop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI.
PaperID: 477,   Poster  https://arxiv.org/pdf/2603.03026     GitHub
Authors: Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka
Title: Any Resolution Any Geometry: From Multi-View To Multi-Patch
Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet highresolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. We address this challenge by adapting the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation---reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36^\circ to 18.27^\circ—while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
PaperID: 478,   Poster  https://arxiv.org/pdf/2512.04069     GitHub
Authors: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithya Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
Title: SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Abstract: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tooluse patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, Internal Benchmark) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines.
PaperID: 479,   Poster  https://arxiv.org/pdf/2508.01450     GitHub
Authors: Xinlin Zhuang, feilong tang, Haolin Yang, Xiwei Liu, Ming Hu, Huifa Li, Haochen Xue, Junjun He, Zongyuan Ge, Yichen Li, Ying Qian, Imran Razzak
Title: Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
Abstract: Supervised FineTuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the “high-difficulty–high-influence” quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling.
PaperID: 480,   Poster  https://arxiv.org/pdf/2603.02829     GitHub
Authors: Huanlei Guo, Hongxin Wei, Bingyi Jing
Title: Toward Early Quality Assessment of Text-to-Image Diffusion Models
Abstract: Recent textto-image (T2I) diffusion models can produce highly realistic images from natural language prompts. In practice, users usually generate multiple candidates and select only a small subset for downstream use, guided by automatic metrics like CLIPScore and ImageReward. However, this post-hoc quality assessment is highly resource-intensive since quality is assessed after dozens to hundreds of denoising steps per image, leading to substantial waste on low-quality samples. To address this issue, we propose Probe-Select, a plug-in framework for early quality assessment in T2I generation. Our key observation is that certain intermediate features within the denoiser—often as early as 20% of the reverse process—already encode stable structural cues (e.g., object layout, spatial composition, and color harmony) that strongly correlate with final image fidelity. Building upon this phenomenon, Probe-Select attaches lightweight probes to these stable activations at an early checkpoint and trains them to align with external evaluators. During inference, the probes forecast image quality on the fly, enabling early pruning of unpromising trajectories so that computation is concentrated on promising ones. Experiments on MS-COCO across multiple generative backbones show that this early assessment mechanism reduces sampling cost by over 60% while improving the quality of the generated images, demonstrating that early structural signals can effectively guide efficient text-to-image generation.
PaperID: 481,   Poster  https://arxiv.org/pdf/2603.04002     GitHub
Authors: Tao Yang, Qing Zhou, Yanliang Li, Qi Wang
Title: Discriminative Perception via Anchored Description for Reasoning Segmentation
Abstract: Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by sharply contrasting the caption’s semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering significant performance gains, with the cIoU on ResonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%.
PaperID: 482,   Poster  https://arxiv.org/pdf/2512.21311     GitHub
Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy J. Mitra
Title: Learning to Solve PDEs on Neural Shape Representations
Abstract: Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or perinstance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.
PaperID: 483,   Poster  https://arxiv.org/pdf/2603.05506     GitHub
Authors: Weijie Lyu, Ming-Hsuan Yang, ZHIXIN SHU
Title: FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large videogeneration models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
PaperID: 484,   Poster  https://arxiv.org/pdf/2604.08537     GitHub
Authors: Mu Nan, Muquan Yu, Weijian Mai, Jacob S. Prince, Hossein Adeli, Rui Zhang, Jiahang Cao, Benjamin Becker, John Pyles, Margaret Marie Henderson, Chunfeng Song, Nikolaus Kriegeskorte, Michael J. Tarr, Xiaoqing Hu, Andrew Luo
Title: Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
Abstract: Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A fieldwide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI thatgeneralizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.
PaperID: 485,   Poster  https://arxiv.org/pdf/2602.21952     GitHub
Authors: Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu
Title: MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Abstract: VisionLanguage Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving.MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuningmethod to optimize the alignment through progressive high-level reward-based learning.MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation.Our trained model and codes will be released once accepted.
PaperID: 486,   Poster  https://arxiv.org/pdf/2601.04342     GitHub
Authors: Mohsen Ghafoorian, Amirhossein Habibian
Title: ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
Abstract: Recent advances in video diffusion models have shifted towards transformerbased architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt’s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation.
PaperID: 487,   Poster  https://arxiv.org/pdf/2603.22870     GitHub
Authors: Yijia Zheng, YU-SHAN TAI, Raymond A. Yeh
Title: Designing to Forget: Deep Semi-parametric Models for Unlearning
Abstract: Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semiparametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by 11% and achieve over 10× faster unlearning compared to existing approaches on parametric models.
PaperID: 488,   Poster  https://arxiv.org/pdf/2512.10950     GitHub
Authors: Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang
Title: E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Abstract: Selfsupervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv2, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
PaperID: 489,   Poster  https://arxiv.org/pdf/2604.16044     GitHub
Authors: Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu, Kun Zhan
Title: Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Abstract: Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signalto-Noise Ratio–timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep; however, this correspondence is disrupted during inference, leading to error accumulation and performance degradation in generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and SDXL) on datasets of various resolutions with negligible computational overhead.
PaperID: 490,   Poster  https://arxiv.org/pdf/2604.03878     GitHub
Authors: Lei Zhou, Haoyu Wu, Akshat Dave, Dimitris Samaras
Title: Learning 3D Reconstruction with Priors in Test Time
Abstract: We introduce a testtime framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On ETH3D, 7-Scenes, and NRGBD datasets, our method cuts the point map distance error by more than half compared to the base image-only models. Our method also outperforms those re-trained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework, in incorporating priors for 3D vision tasks.
PaperID: 491,   Poster  https://arxiv.org/pdf/2604.15857     GitHub
Authors: Taewoong Kang, Hyojin Jang, Sohyun Jeong, Seunggi Moon, Gihwi Kim, Hoon Jung, Jaegul Choo
Title: AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Abstract: Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping—where one image's head is seamlessly integrated onto another's body. Current approaches predominantly rely on facecentered cropped data with limited view angles, significantly restricting their real-world applicability. These methods struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that our approach achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve both identity and expression fidelity across various head orientations and hairstyles. Notably, our method shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
PaperID: 492,   Poster  https://arxiv.org/pdf/2604.19093     GitHub
Authors: Jinglin Xu, Yi Li, Chuxiong Sun, Xiao Xu, Jiangmeng Li, Fanjiang Xu
Title: Multi-modal Test-time adaptation via Adaptive Probabilistic Gaussian Calibration
Abstract: Multimodal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category‑conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts.
PaperID: 493,   Poster  https://arxiv.org/pdf/2603.09101     GitHub
Authors: Chenran Zhang, Ruiqi Wu, Tao Zhou, Yi Zhou
Title: MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
Abstract: Medical visionlanguage pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. The source codes will be released upon acceptance.
PaperID: 494,   Poster  https://arxiv.org/pdf/2602.21461     GitHub
Authors: Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu, Weiming Ren, Zhaochong An, Zijian Zhou, Haonan Qiu, Yuyin Zhou, Sen He, Ziheng Wang, Tao Xiang, Xiao Han
Title: VecGlypher: Unified Vector Glyph Generation with Language Models
Abstract: Vector glyphs are the atomic units of digital typography, yet most learningbased pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
PaperID: 495,   Poster  https://arxiv.org/pdf/2604.01561     GitHub
Authors: Yanzhe Liang, Ruijie Zhu, Hanzhi Chang, Zhuoyuan Li, Jiahao Lu, Tianzhu Zhang
Title: ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel selfcorrection manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction.Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.
PaperID: 496,   Poster  https://arxiv.org/pdf/2512.03043     GitHub
Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, shuang chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
Title: OneThinker: All-in-one Reasoning Model for Image and Video
Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities.To this end, we propose OneThinker, an allin-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start.Moreover, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks.Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data will be released.
PaperID: 497,   Poster  https://arxiv.org/pdf/2512.14697     GitHub
Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbühl
Title: Spherical Leech Quantization for Visual Tokenization and Generation
Abstract: Lookupfree quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization (\Lambda_24-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
PaperID: 498,   Poster  https://arxiv.org/pdf/2504.05662     GitHub
Authors: Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa
Title: InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
Abstract: Despite the remarkable success, recent reconstructionbased anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we proposeInvAD, a novelInversion-basedAnomalyDetection approach — “detection via noising in latent space” — which circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing “detection via denoising in RGB space” paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we enforce only a few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analyses across four widely used industrial and medical AD benchmarks under the unsupervised unified setting to demonstrate the effectiveness of our model, achieving state-of-the-art AD performance and approximately 2× inference-time speedup without diffusion distillation.
PaperID: 499,   Poster  https://arxiv.org/pdf/2603.27179     GitHub
Authors: yizhou jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu, Yunhong Wang
Title: Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection.However, most approaches remain confined to imagelevel anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations.In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels.Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps.We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization.Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability.Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision.
PaperID: 500,   Poster  https://arxiv.org/pdf/2512.03619     GitHub
Authors: Muhammed Burak Kızıl, Enes Şanlı, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan
Title: LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Abstract: Recent advances in video generation have achieved remarkable progress in visual fidelity and controllability, enabling conditioning not only on text but also on structural layout and motion signals. Among these, motion control (i.e., specifying both object dynamics and camera trajectories) is particularly critical for directing complex, cinematic scenes, yet existing interfaces remain limited. To address this gap, we introduce LAMP that leverages large language models~(LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for both dynamic objects and (relatively defined) cameras. Specifically, we finetune an LLM to generate frame-wise 3D bounding-box trajectories for objects and, conditioned on these, produce corresponding 3D camera paths, which are then converted into generator-compatible 2D control signals. We enable this by constructing a large-scale paired datasets through a combination of procedurally generated text–trajectory pairs and augmented real video datasets with 3D annotations. Experiments demonstrate improved controllability and alignment with user intent compared to state-of-the-art alternatives, establishing the first framework for joint object–camera trajectory generation directly from natural language.
PaperID: 501,   Poster  https://arxiv.org/pdf/2512.19546     GitHub
Authors: Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, zhengguang zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, qinglin lu, Jun He
Title: ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient textfollowing capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process—early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model’s text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality. The code will be made publicly available.
PaperID: 502,   Poster  https://arxiv.org/pdf/2512.00975     GitHub
Authors: Haotian Liang, Xinyi Chen, Bin Wang, MingKang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, Dong Liu, Xiaokang Yang, Yao Mu, Wenqi Shao, Ping Luo
Title: MM-ACT: Learn from Multimodal Parallel Generation to Act
Abstract: A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MMACT, a unified Vision-Language-Action(VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context‑Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal task learning.Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances, respectively. Our approach achieves a success rate of 96.3% on LIBERO, 62.2% across four tasks of Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0, with an additional gain of 9.25% from text-image co-training.
PaperID: 503,   Poster  https://arxiv.org/pdf/2505.18675     GitHub
Authors: Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang
Title: ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Abstract: Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and textimage alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. However, their proficiency in tasks requiring both fine-grained visual understanding and spatial reasoning remains underexplored.To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models. Code and data samples are in the Supplementary.
PaperID: 504,   Poster  https://arxiv.org/pdf/2602.06393     GitHub
Authors: Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, Dongyoon Han
Title: MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
Abstract: Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of querytarget pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities.
PaperID: 505,   Poster  https://arxiv.org/pdf/2506.23690     GitHub
Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Ke Ma, Yan Wang, Kecheng Zheng, Xing Zhu, Yujun Shen, Hengshuang Zhao
Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
Abstract: Diffusionbased video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., cats or dogs) to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which alternately optimizes subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that SynMotion outperforms existing baselines.
PaperID: 506,   Poster  https://arxiv.org/pdf/2508.11479     GitHub
Authors: Tatiana Zemskova, Aleksei Staroverov, Dmitry Yudin, Aleksandr Panov
Title: OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
Abstract: Openvocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies tend to overfit small simulator datasets, achieving high success on training scenes but failing to generalize and often exhibiting unsafe behavior (frequent collisions). In our work, we are the first to show that a high degree of generalization to unseen categories in the open-vocabulary object goal navigation task can be achieved with a lightweight transformer model (130M parameters) using only RGB input. We introduce the OVSegDT approach, which has three key features. First, we add a goal binary mask encoder that grounds the textual goal and provides precise spatial cues. The second component is a proposed Entropy-Adaptive Loss Modulation (EALM) — a per-sample scheduler that continuously balances imitation and reinforcement signals according to policy entropy, eliminating brittle manual phase switches. EALM reduces the sample complexity of training by 33% and cuts the collision count by 10% compared to the baseline. The final component improves the agent’s navigation quality even under noisy predicted segmentation by combining an auxiliary segmentation loss with a reward function based on the area of the true goal mask during fine-tuning on predicted segmentation. On HM3D-OVON, our model achieves performance on unseen categories comparable to that on seen ones and establishes state-of-the-art results (44.7% SR, 20.6% SPL on val unseen) without using depth, odometry, or large vision–language models
PaperID: 507,   Poster  https://arxiv.org/pdf/2509.19995     GitHub
Authors: Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura
Title: MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
Abstract: Scaling artistdesigned meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns.We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles—substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions.This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure.Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.
PaperID: 508,   Poster  https://arxiv.org/pdf/2512.11988     GitHub
Authors: Xianghui Xie, Bowen Wen, Yan Chang, Hesam Rabeti, Jiefeng Li, Ye Yuan, Gerard Pons-Moll, Stan Birchfield
Title: CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
Abstract: Accurate capture of humanobject interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
PaperID: 509,   Poster  https://arxiv.org/pdf/2604.03225     GitHub
Authors: Rongyuan Wu, Lingchen Sun, Zhengqiang ZHANG, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang
Title: VOSR: A Vision-Only Generative Model for Image Super-Resolution
Abstract: Largescale pre-trained text-to-image (T2I) diffusion models, such as Stable Diffusion, can be finetuned for image super-resolution (SR) with highly realistic details. While impressive, pre-training such multi-modal models demands billions of high-quality text-image pairs and substantial computational resources, despite that SR is fundamentally an image-to-image (I2I) task. This raises a critical question: do we truly need multi-modal priors and billion-scale text-image data to solve a purely vision task? In this paper, we proposeVOSR, aVision-OnlySuper-Resolution framework that eliminates the need for textual priors and multi-modal pretraining. We identify two key limitations in previous image-based, uni-modal diffusion models: limited visual semantic guidance and unstable unconditional training. To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. To accelerate inference, we adopt a modified shortcut model for one-step SR with minimal quality degradation. VOSR is trained from scratch with significantly less data and a lower computational cost compared to T2I-based diffusion models. However, VOSR achieves comparable or even better performance than state-of-the-art T2I-tuned SR methods on both synthetic and real-world benchmarks, demonstrating its potential as a scalable and competitive alternative for generative SR. Codes and models will be made publicly available.
PaperID: 510,   Poster  https://arxiv.org/pdf/2512.15693     GitHub
Authors: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
Title: Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Abstract: The misuse of AIdriven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations.We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection. Our code, models, and datasets will be made publicly available.
PaperID: 511,   Poster  https://arxiv.org/pdf/2603.00483     GitHub
Authors: Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
Title: RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
Abstract: Recent textto-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt–image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision–language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions—including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement.
PaperID: 512,   Poster  https://arxiv.org/pdf/2602.10102     GitHub
Authors: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
Title: VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a hallmark of advanced artificial intelligence. We present VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw realworld videos. At its core, VideoWorld 2 introduces a disentangled Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related changes. These latent codes are then modeled autoregressively as a sequence to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on real-world video handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. VideoWorld 2 achieves over a 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
PaperID: 513,   Poster  https://arxiv.org/pdf/2512.07806     GitHub
Authors: Gyeongjin Kang, Seung kwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park
Title: Multi-view Pyramid Transformer: Look Coarser to See Broader
Abstract: We propose Multiview Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
PaperID: 514,   Poster  https://arxiv.org/pdf/2603.04825     GitHub
Authors: Rui Zhao, Bin Shi, Kai Sun, Bo Dong
Title: Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In realworld scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive theoretical and experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance.
PaperID: 515,   Poster  https://arxiv.org/pdf/2601.01386     GitHub
Authors: Xiaobao Wei, Zhangjie Ye, Yuxiang Gu, Zunjie Zhu, Yunfei Guo, Yingying Shen, Shan Zhao, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Rongfeng Lu, Hangjun Ye
Title: ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
Abstract: Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPSdenied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released.
PaperID: 516,   Poster  https://arxiv.org/pdf/2512.08912     GitHub
Authors: Simon de Moreau, Andrei Bursuc, Hafid EL IDRISSI, Fabien Moutarde
Title: LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Abstract: Nighttime environments pose significant challenges for camerabased perception, as existing methods passively rely on the scene lighting.We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high‑definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero‑shot in real‑world closed‑loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low‑beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain‑generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost‑effective solution to robust nighttime perception.
PaperID: 517,   Poster  https://arxiv.org/pdf/2603.27201     GitHub
Authors: Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Title: Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
Abstract: Multimodal Chainof-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process.However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can conveniently integrate with other hallucination mitigation methods and further boost their performance. The code will be released.
PaperID: 518,   Poster  https://arxiv.org/pdf/2603.02133     GitHub
Authors: Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan
Title: SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Abstract: Compositional scene reconstruction seeks to create objectcentric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
PaperID: 519,   Poster  https://arxiv.org/pdf/2512.15715     GitHub
Authors: Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Saining Xie, Hengshuang Zhao, Kaiming He, Hu Xu
Title: In Pursuit of Pixel Supervision for Visual Pre-training
Abstract: Pixels provide a lightweight, scalable way to encode the physical world, preserving rich visual information with minimal human inductive bias. We demonstrate that visual pretraining using pixel supervision alone can learn desirable visual properties and produce strong representations, while remaining simple, stable, and efficient. We present Pixo, a capable self-supervised model trained by purely predicting pixels. It is instantiated on the masked autoencoding (MAE) framework, but enhances MAE with a deeper decoder, larger-block masking, and additional class tokens. It is trained on 2B web-crawled images with a self-curated strategy. Pixo performs well on many downstream tasks, covering monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), object segmentation (e.g., SAM 2), and embodied AI. We will release the training code and pre-trained models.
PaperID: 520,   Poster  https://arxiv.org/pdf/2511.20343     GitHub
Authors: Hengyi Wang, Lourdes Agapito
Title: AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
Abstract: We present AMB3R, a multiview feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
PaperID: 521,   Poster  https://arxiv.org/pdf/2604.04161     GitHub
Authors: Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Chua, Prahlad Vadakkepat
Title: Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Abstract: In VisionLanguage-Action (VLA) models, action chunking (i.e., executing a sequence of actions without intermediate replanning) is a key technique to improve robotic manipulation abilities. However, a large chunk size reduces the model’s responsiveness to new information, while a small one increases the likelihood of mode-jumping, jerky behavior resulting from discontinuities between chunks. Therefore, selecting the optimal chunk size is an urgent demand to balance the model's reactivity and consistency. Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. Extensive experiments on a wide range of simulated and real-world robotic manipulation tasks have demonstrated that our approach substantially improves performance over the state-of-the-art alternatives. The videos and source code will be made publicly available.
PaperID: 522,   Poster  https://arxiv.org/pdf/2602.19870     GitHub
Authors: Qiankun Ma, Ziyao Zhang, Haofei Wang, Zhen Song, Jie Chen, Hairong Zheng
Title: ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Abstract: Recent VisionLanguage Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical.
PaperID: 523,   Poster  https://arxiv.org/pdf/2511.20253     GitHub
Authors: Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi
Title: Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Abstract: 3D object detection is fundamental for spatial understanding. Realworld environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D_0, which requires no training at all, and the self-supervised Zoo3D_1, which refines 3D box prediction by training a class-agnostic detector on Zoo3D_0-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D_0 and Zoo3D_1 achieve state-of-the-art results in open-vocabulary 3D detection. Remarkably, our zero-shot Zoo3D_0 outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding.
PaperID: 524,   Poster  https://arxiv.org/pdf/2602.23963     GitHub
Authors: Qiuyang Zhang, Jiujun Cheng, Qichao Mao, Cong Liu, Yu Fang, Yuhong Li, Mengying Ge, Shangce Gao
Title: SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
Abstract: Spiking Neural Networks (SNNs) promise energyefficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons’ spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient.
PaperID: 525,   Poster  https://arxiv.org/pdf/2602.20913     GitHub
Authors: Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye
Title: LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Abstract: This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets.We propose LongVideoR1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search.At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing.During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query.To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation.Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available.
PaperID: 526,   Poster  https://arxiv.org/pdf/2603.25316     GitHub
Authors: Yunuo Chen, Bing He, Zezheng Lyu, Hongwei Hu, Qunshan Gu, Yuan Tian, Guo Lu
Title: Adaptive Learned Image Compression with Graph Neural Networks
Abstract: Efficient image compression relies on the accurate detection and elimination of both local and global redundancy. While most stateof-the-art (SOTA) learned image compression (LIC) methods are built on Convolutional Neural Networks (CNNs) or Transformer architectures, these frameworks are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model’s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level.To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression.Experiments demonstrate that GLIC achieves SOTA performance, outperforming VTM-9.1 by-19.29%, -21.69%, -18.71% in BD-rate on Kodak, Tecnick, and CLIC datasets, respectively. Code will be released.
PaperID: 527,   Poster  https://arxiv.org/pdf/2511.22989     GitHub
Authors: Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Title: MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Abstract: Recent textto-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts.However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions.In addition, their task definitions are vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation.
PaperID: 528,   Poster  https://arxiv.org/pdf/2604.17041     GitHub
Authors: Yifei Zhao, Qian Lou, Mengxin Zheng
Title: SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
Abstract: The public accessibility of Large Vision–Language Models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification approaches often rely on semantically abnormal queries or outof-distribution responses as fingerprints, which are easily recognized and removed by adversaries.We first expose this vulnerability through the Semantic Divergence Attack (SDA), which detects and filters fingerprint checks by measuring semantic divergence between a stolen model and a reference model, showing that existing fingerprints are not semantic-preserving, easy to detect and bypass, and lacking robustness. To address these weaknesses, we proposeSIF(Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework requiring no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which distills text-generation watermark signals—originally designed for text ownership protection rather than model protection—into the visual modality, enabling semantically coherent yet fingerprinted responses. Robust-Fingerprint Optimization (RFO) further simulates worst-case representation perturbations, ensuring resilience to perturbations such as fine-tuning and quantization.Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate thatSIFachieves superior stealthiness and robustness, providing a practical solution for LVLM copyright protection.
PaperID: 529,   Poster  https://arxiv.org/pdf/2603.07832     GitHub
Authors: Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut
Title: GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Abstract: Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remotecamera gaze estimation, VR gaze research remains constrained by data scarcity—particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze—the first large-scale off-axis gaze estimation dataset for VR—comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze–appearance disentanglement in a compact, real-time model. A lightweight few-shot calibration can optionally adapt embeddings to individual users, achieving 1.84° mean error on VRGaze under per-person calibration and 7.15° on MPIIGaze under person-agnostic calibration, with a tenfold reduction in parameters and 5 ms runtime on a VR headset GPU. Quantitative robustness analyses confirm invariance to illumination variations, demonstrating a label-efficient and deployable solution for VR gaze estimation.VRGaze and GazeShift are released under \urlhttps://github.com/gazeshift3/gazeshift.
PaperID: 530,   Poster  https://arxiv.org/pdf/2603.07476     GitHub
Authors: WENQI CAI, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang
Title: EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Abstract: Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusionbased DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision–Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Code will be released.
PaperID: 531,   Poster  https://arxiv.org/pdf/2512.13683     GitHub
Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera
Title: 3D Instance Models are Implicit Generalizable Spatial Learners
Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning‑based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts.We instead reprogram a pre‑trained 3D instance generator to act as a scene‑level learner via, replacing datasetbounded supervision with model-centric spatial supervision.This reprogramming unlocks the generator's transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions.Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues.Replacing widely used canonical space, we instantiate this insight with a view‑centric formulation of the scene space, yielding a fully feed‑forward, generalizable scene generator that learns spatial relations directly from the instance model.Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation.
PaperID: 532,   Poster  https://arxiv.org/pdf/2604.10359     GitHub
Authors: Alexandru Brateanu, Tingting Mu, Codruta Ancuti, Cosmin Ancuti
Title: Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
Abstract: Lowlight image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often depend on large model size and multi-stage training, limiting practicality for edge deployment. Moreover, they often rely on a single color space, which introduces instability and visible exposure or color artifacts. To achieve low-cost, effective LLIE, we present Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. We emphasize enhancement over reconstruction to enable drastic reduction of computational overhead, supported by lightweight neural operations. Accordingly, we develop a lightweight Multinex (45K parameters) and a micro version (2.6K parameters). Examined by intensive benchmark comparison, they outperform significantly lightweight and micro SOTA models, while reach close performance to large SOTA models. Code will be released upon publication. Across extensive benchmarks, both variants significantly outperform existing lightweight and micro SOTA models, and reach performance comparable to more complex approaches. Code will be released upon publication.
PaperID: 533,   Poster  https://arxiv.org/pdf/2603.05181     GitHub
Authors: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Title: Mario: Multimodal Graph Reasoning with Large Language Models
Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision–language models (VLMs) to encode image–text pairs in isolation, ignoring the relational structure that realworld multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code is available in supplementary materials.
PaperID: 534,   Poster  https://arxiv.org/pdf/2511.04555     GitHub
Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao
Title: Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Abstract: VisionLanguage-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference.Moreover, most training paradigms often degrade the perceptual representations of the Vision–Language backbone, resulting in overfitting and poor generalization to downstream tasks.In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision–Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture.We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM.Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO.In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods.We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
PaperID: 535,   Poster  https://arxiv.org/pdf/2603.19862     GitHub
Authors: Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew Bagdanov
Title: IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Abstract: VisionLanguage Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP models.
PaperID: 536,   Poster  https://arxiv.org/pdf/2601.02994     GitHub
Authors: Youngjoon Jeong, Junha Chun, Taesup Kim
Title: Learning to Act Robustly with View-Invariant Latent Actions
Abstract: Visionbased robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance.Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences.Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
PaperID: 537,   Poster  https://arxiv.org/pdf/2512.12982     GitHub
Authors: Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu, Yipu Wang, Xiaolong Zheng
Title: Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
Abstract: The pursuit of a universal AIgenerated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, leveraging Linear Discriminant Analysis (LDA), diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that replaces unconstrained aggregation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity. To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation (LoRA) to fine-tune the feature extractor, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of unseen GAN and diffusion-based generators.
PaperID: 538,   Poster  https://arxiv.org/pdf/2603.29931     GitHub
Authors: Yuhang Yang, Fan Zhang, Huaijin Pi, Ailing Zeng, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: Gloria: Consistent Character Video Generation via Content Anchors
Abstract: Digital characters are central to modern media, yet generating character videos with longduration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an ``outside-looking-in" scenario. In this work, we propose represent the character’s visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors.Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
PaperID: 539,   Poster  https://arxiv.org/pdf/2511.18794     GitHub
Authors: Zhongtao Wang, Jiaqi Dai, Qingtian Zhu, Yilong Li, Mai Su, Fei Zhu, Meng GAI, Shaorong Wang, Chengwei Pan, Yisong Chen, Guoping Wang
Title: ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
Abstract: Multiperiod image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset will be made publicly available.
PaperID: 540,   Poster  https://arxiv.org/pdf/2603.02557     GitHub
Authors: Maoyuan Shao, Yutong Gao, Xinyang Huang, Lijuan Sun, Guoshun Nan, Chuang Zhu
Title: CAPT : Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
Abstract: Visionlanguage models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability. To address this, we proposeCAPT, aConfusion-AwarePromptTuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72% of confusable sample pairs.
PaperID: 541,   Poster  https://arxiv.org/pdf/2512.15599     GitHub
Authors: Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
Title: FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
Abstract: We introduce FlexAvatar, a method for creating highquality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations.
PaperID: 542,   Poster  https://arxiv.org/pdf/2603.01034     GitHub
Authors: Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Title: Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Abstract: Tensor Ring (TR) decomposition is a powerful tool for highorder data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches.
PaperID: 543,   Poster  https://arxiv.org/pdf/2603.04846     GitHub
Authors: Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiaojun Wu, Josef Kittler
Title: Multi-Paradigm Collaborative Adversarial Attack Against Multimodal Large Language Models
Abstract: The rapid progress of MultiModal Large Language Models (MLLMs) has significantly advanced downstream applications.However, transferable adversarial threats are thus exposed in the public community.In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces.This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations.To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy.By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias.Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.The code will be released here.
PaperID: 544,   Poster  https://arxiv.org/pdf/2603.05235     GitHub
Authors: ZHENYU ZHANG, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
Abstract: SourceFree Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant.Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to re-utilize information in these lost layersat both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. We will release the code.
PaperID: 545,   Poster  https://arxiv.org/pdf/2602.01639     GitHub
Authors: tianyu yang, ChenWei He, xiangzhao hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-seng Chua
Title: ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dualtower Vision–Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads toCapability Degradation—the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we proposeReCALL(Recalibrating Capability Degradation), a model-agnostic framework that follows adiagnose–generate–refinepipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual–semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.
PaperID: 546,   Poster  https://arxiv.org/pdf/2510.22213     GitHub
Authors: Yaokun Li, Lihe Ding, Xiao Chen, Guang Tan, Tianfan Xue
Title: DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Abstract: Generating dynamic and interactive 3D trees has wide applications in virtual reality, games, and world simulation. However, existing methods still face various challenges in generating structurally consistent and realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate longterm, interactive 3D motion for 3DGS reconstructions of real trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing 8,786 animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.
PaperID: 547,   Poster  https://arxiv.org/pdf/2603.22509     GitHub
Authors: Delin An, Chaoli Wang
Title: Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
Abstract: Diffusion probabilistic models have demonstrated significant potential in generating highquality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets.
PaperID: 548,   Poster  https://arxiv.org/pdf/2604.17736     GitHub
Authors: Haotian Qin, Dongliang Chang, Yueying Gao, Yuexuan Tan, Lei Chen, Zhanyu Ma
Title: IncreFA: Breaking the Static Wall of Generative Model Attribution
Abstract: As AI generative models evolve at unprecedented speed, image attribution has become a moving target. New diffusion, adversarial and autoregressive generators appear almost monthly, making existing watermark, classifier and inversion methods obsolete upon release. The core problem lies not in model recognition, but in the inability to adapt attribution itself.We introduce IncreFA, a framework that redefines attribution as a structured incremental learning problem, allowing the system to learn continuously as new generative models emerge. IncreFA departs from conventional incremental learning by exploiting the hierarchical relationships among generative architectures and coupling them with continual adaptation. It integrates two mutually reinforcing mechanisms: (1) Hierarchical Constraints, which encode architectural hierarchies through learnable orthogonal priors to disentangle familylevel invariants from model-specific idiosyncrasies; and (2) a Latent Memory Bank, which replays compact latent exemplars and mixes them to generate pseudo-unseen samples, stabilising representation drift and enhancing open-set awareness.On the newly constructed Incremental Attribution Benchmark (IABench) covering 28 generative models released between 2022 and 2025, IncreFA achieves state-of-the-art attribution accuracy and 98.9% unseen detection under a temporally ordered open-set protocol.
PaperID: 549,   Poster  https://arxiv.org/pdf/2604.17052     GitHub
Authors: Zhijia Liang, Jiaming Li, Weikai Chen, Yanhao Zhang, Haonan Lu, Guanbin Li
Title: OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
Abstract: Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement -- short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieve memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and compatible with any streaming MLLM. Experiments across multiple benchmarks show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with far less memory budget.
PaperID: 550,   Poster  https://arxiv.org/pdf/2604.08008     GitHub
Authors: Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler
Title: SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
Abstract: Retrieving rare and safetycritical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples.We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 386k bounding boxes covering 64 rare categories. It specifically targets the “needle-in-a-haystack” problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models.Comprehensive zero-shot evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. Models that directly align spatial visual features with language achieve the best relative results, yet none demonstrate satisfactory retrieval capability in absolute terms. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale benchmark for retrieval-driven data curation and long-tail perception research in AD.
PaperID: 551,   Poster  https://arxiv.org/pdf/2506.08456     GitHub
Authors: William June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee
Title: Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
Abstract: Recent textto-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33% average improvement across models in dynamic degree while maintaining the original video quality.
PaperID: 552,   Poster  https://arxiv.org/pdf/2603.13660     GitHub
Authors: Yunhe Gao, Yabin Zhang, Chong Wang, Jiaming Liu, Maya Varma, Jean-Benoit Delbrouck, Akshay Chaudhari, Curtis Langlotz
Title: Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
Abstract: Foundation models have transformed vision and language by learning generalpurpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. These results validate that mask-guided pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code and models will be made publicly available.
PaperID: 553,   Poster  https://arxiv.org/pdf/2603.00697     GitHub
Authors: Yihui Li, Chengxin Lv, Zichen Tang, Hongyu Yang, Di Huang
Title: TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
Abstract: We presentTokenSplat, a feedforward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces aToken-aligned Gaussian Predictionmodule that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and anAsymmetric Dual-Flow Decoder (ADF-Decoder)that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement.Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.
PaperID: 554,   Poster  https://arxiv.org/pdf/2512.18692     GitHub
Authors: Minh-Quan Viet Bui, Jongmin Park, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim
Title: EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
Abstract: Feedforward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks. Code and project page will be released.
PaperID: 555,   Poster  https://arxiv.org/pdf/2603.06178     GitHub
Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
Title: Making Training-Free Diffusion Segmentors Scale with the Generative Power
Abstract: As powerful generative models, textto-image diffusion models have recently been explored for discriminative tasks as well. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what we call training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model’s attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly, and in some cases, segmentation performance even degrades when using more powerful models. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage model capability. We extensively evaluate our approach on standard semantic segmentation benchmarks and further integrate it into an advanced generative framework, demonstrating both its broad applicability and improved performance.
PaperID: 556,   Poster  https://arxiv.org/pdf/2512.16295     GitHub
Authors: Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding
Title: OSOracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
Abstract: The deployment of autonomous agents in Graphical User Interface (GUI) environments confronts significant challenges, notably error accumulation in longhorizon tasks and the severe consequences of irreversible operations. While critic models that provide real-time action assessment offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for computer use.To bridge these gaps,we introduce OS-Oracle that makes three core contributions:(1) a scalable data pipeline for synthesizing cross-platform GUI critic data;(2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms.Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves impressive performance,and further reduces error rates, which improves the capability of GUI agents in dynamic environments. All codes, data and checkpoints will be made public.
PaperID: 557,   Poster  https://arxiv.org/pdf/2602.13195     GitHub
Authors: Aadarsh Sahoo, Georgia Gkioxari
Title: Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Abstract: Conversational image segmentation grounds abstract, intentdriven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (\eg, “left-most apple”) and overlooks functional and physical reasoning (\eg, “where can I safely store the knife?”). We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt–mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConvSeg-Net trained on our data engine achieves significant gains on ConvSeg and maintains strong performance on existing language-guided segmentation benchmarks.
PaperID: 558,   Poster  https://arxiv.org/pdf/2604.01646     GitHub
Authors: Junyoung Jung, Seokwon Kim, Jung Uk Kim
Title: MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
Abstract: Monocular 3D Object Detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparselyannotated setting is common in real-world scenarios where annotating every object is impractical.To address this, we propose a novel framework for sparsely-annotated monocular 3D object detection with two key modules.First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. PBF maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates.Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision.Extensive results demonstrate the effectiveness of the proposed method. The source code will be publicly available.
PaperID: 559,   Poster  https://arxiv.org/pdf/2604.19954     GitHub
Authors: Xinxuan Lu, Charless Fowlkes, Alex Berg
Title: Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Abstract: Current textto-image models struggle to express precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune a unified multimodal model for viewpoint-conditioned text-to-image generation on a curated dataset that combines high-volume rendered images for geometric supervision with low-volume photorealistic augmentations for appearance diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy across all camera parameters while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation.
PaperID: 560,   Poster  https://arxiv.org/pdf/2512.04939     GitHub
Authors: Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, Xiao-Xiao Long
Title: LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
Abstract: 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is timeconsuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10× speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token’s geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT’s core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT’s effectiveness, scalability, and robustness.
PaperID: 561,   Poster  https://arxiv.org/pdf/2604.07812     GitHub
Authors: Qihui Zhu, Tao Zhang, yuchen wang, Shuangwu chen, Xiaobin Tan, jianyang jianyang, liuyang liuyang, PanYinfei PanYinfei
Title: HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for realtime or resource-constrained applications.Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens.Existing researches usually assume that all attention heads contribute equally to the visual interpretation.However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing.In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens.By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones.The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs.Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models.
PaperID: 562,   Poster  https://arxiv.org/pdf/2507.20630     GitHub
Authors: Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
Title: TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Abstract: Large VisionLanguage Models (LVLMs) have advanced multimodal learning but face high computational cost issues due to the input of large number of visual tokens, motivating token pruning to improve inference efficiency.The key challenge lies in identifying which tokens are truly important.Most existing approaches rely on attention- or similarity-based criteria to estimate token importance.However, they inherently suffer from certain limitations, such as being task-agnostic and exhibiting positional bias.In this work, we explore a new perspective on token importance assignment based on token transitions in LVLMs, where token transitions are defined as the changes in token representations occurring as they propagate through the model’s modules.We observe that the transition of token representations provides a meaningful signal of semantic information.Based on this insight, we propose TransPrune, a training-free and efficient token pruning method.Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV), which measures changes in both the magnitude and direction of token representations; as well as Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to visual tokens via attention.Extensive experiments on various LVLM architectures, such as LLaVA-v1.5, LLaVA-Next and Qwen2.5-VL, demonstrate that TransPrune maintains comparable multimodal performance while reducing inference TFLOPs by more than half.
PaperID: 563,   Poster  https://arxiv.org/pdf/2602.18424     GitHub
Authors: Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich
Title: CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Abstract: VisionLanguage Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent’s mobility constraints. For example, a sweeping robot cannot traverse stairs while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent’s specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We close by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs.
PaperID: 564,   Poster  https://arxiv.org/pdf/2603.02351     GitHub
Authors: Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, Nandita Vijaykumar
Title: MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
Abstract: Recent advancements in neural visual geometry, including transformerbased models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets—including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks—MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.
PaperID: 565,   Poster  https://arxiv.org/pdf/2512.22324     GitHub
Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Title: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts.In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful subcomponents. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts.Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings.These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition. Our implementation will be released.
PaperID: 566,   Poster  https://arxiv.org/pdf/2603.28020     GitHub
Authors: Huimin Zeng, Yue Bai, hailing wang, Yun Fu
Title: Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
Abstract: High dynamic range novel view synthesis (HDRNVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS).
PaperID: 567,   Poster  https://arxiv.org/pdf/2511.11007     GitHub
Authors: Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, Shuicheng Yan
Title: VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Abstract: Despite the remarkable success of VisionLanguage Models (VLMs), their performance on a range of complex visual tasks is often hindered by a "visual processing bottleneck": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The source code will be made publicly available.
PaperID: 568,   Poster  https://arxiv.org/pdf/2603.25275     GitHub
Authors: Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia, Cheng Wang, Chenglu Wen
Title: V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range, hindering progress toward Level 5 autonomy. While existing cooperative perception paradigms such as Vehicleto-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex 3D environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K manually annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness, particularly under severe occlusion conditions.
PaperID: 569,   Poster  https://arxiv.org/pdf/2412.14233     GitHub
Authors: Yanpeng Sun, JING HAO, Ke Zhu, Jiang-Jiang Liu, Xiaofan Li, Na Zhao, Zechao Li, Jingdong Wang
Title: Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Abstract: Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage offthe-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named EDC, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. By systematically integrating these rich attributes into the generated captions, EDC significantly improves the descriptive quality of the captions, providing a deeper and more nuanced understanding of the visual content. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding.
PaperID: 570,   Poster  https://arxiv.org/pdf/2511.19827     GitHub
Authors: Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye
Title: ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
Abstract: We present ReDirector, a novel cameracontrolled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.
PaperID: 571,   Poster  https://arxiv.org/pdf/2604.03318     GitHub
Authors: Zhenghao Chen, Huiqun Wang, Di Huang
Title: EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
Abstract: Multimodal large language models (MLLMs) are increasingly applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most concurrent works enhance spatial reasoning by introducing 3D priors or geometric supervision, which improves performance but incurs substantial data preparation and alignment costs. Purely 2D approaches, however, struggle with multiframe spatial reasoning due to missing viewpoint transitions and overlooked implicit objects that act as spatial bridges. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Captioning and Progressive Spatial Analysis, jointly constructing a coherent linguistic scene graph across frames. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, STI-Bench and SPBench, demonstrating its effectiveness in reinforcing the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition.
PaperID: 572,   Poster  https://arxiv.org/pdf/2604.08192     GitHub
Authors: Yunxiang Peng, Mengmeng Ma, Ziyu Yao, Xi Peng
Title: Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Abstract: Reliable generalization metrics are fundamental to both the development and evaluation of machine learning models. Especially in highstakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable, label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model outputs while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using a model’s inner working, i.e. circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across diverse tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 11.0% and 45.3%, respectively.
PaperID: 573,   Poster  https://arxiv.org/pdf/2601.11514     GitHub
Authors: Mohd Yawar Nihal Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel
Title: ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and wellsegmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in the wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to SoTA.
PaperID: 574,   Poster  https://arxiv.org/pdf/2603.03871     GitHub
Authors: Jinyuan Liu, Xingyuan Li, Qingyun Mei, HaoYuan Xu, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Title: Bridging Human Evaluation to Infrared and Visible Image Fusion
Abstract: Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the illposed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune fusion networks through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics.
PaperID: 575,   Poster  https://arxiv.org/pdf/2512.01248     GitHub
Authors: JUNYUAN ZHANG, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He
Title: TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Abstract: Table recognition (TR) aims to transform table images into semistructured representations such as HTML or Markdown.As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data.While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain.Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind.To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model.This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks.
PaperID: 576,   Poster  https://arxiv.org/pdf/2508.13911     GitHub
Authors: chunji lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Chen Wei, Yinjie Lei, Changsheng Li
Title: PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
Abstract: Despite advances in physicsbased 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naïve concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that directly infers both Gaussian and physical parameters. We further refine the model with Direct Preference Optimization (DPO), aligning simulations with the physically plausible reference videos and avoiding the high-cost SDS optimization. To address the absence of a supporting dataset for this task, we propose PhysAssets, a dataset of 50K+ 3D assets annotated with physical properties and corresponding reference videos. Experiments show that PhysGM produces high-fidelity 4D simulations from a single image in one minute, achieving a significant speedup over prior work while delivering realistic renderings.
PaperID: 577,   Poster  https://arxiv.org/pdf/2603.13341     GitHub
Authors: ZHENYU ZHANG, Yixiong Zou, Yuhua Li, Ruixuan Li, Guangyao Chen
Title: Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
Abstract: SourceFree Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that strengthening visual-modal discriminability actually suppresses VLMs’ performance. In this paper, we aim to delve into this phenomenon for an interpretation and a solution.By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss (L_vlm) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL.However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce L_vlm without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance.Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. We will release the code.
PaperID: 578,   Poster  https://arxiv.org/pdf/2603.25336     GitHub
Authors: Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, Sungroh Yoon
Title: HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Abstract: Visual Geometry Grounded Transformer (VGGT) has shown significant progress in 3D vision tasks. However, its global attention layers incur quadratic computational cost with respect to the number of input views, becoming a critical bottleneck for scalability. Several sparsificationbased acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget—assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels.
PaperID: 579,   Poster  https://arxiv.org/pdf/2512.23592     GitHub
Authors: Damiano Marsili, Aditya Mehta, Ryan Lin, Georgia Gkioxari
Title: Same or Not? Enhancing Visual Perception in Vision-Language Models
Abstract: Vision–language models (VLMs) excel at broad visual understanding but remain coarsegrained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (“Is it a cat or a dog?”) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models.
PaperID: 580,   Poster  https://arxiv.org/pdf/2511.16407     GitHub
Authors: Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, Wei Li
Title: LAOF: Robust Latent Action Learning with Optical Flow Constraints
Abstract: Learning latent actions from largescale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints (LAOF), a pseudo-supervised framework that leverages the agent’s optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10%. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1% of action labels.
PaperID: 581,   Poster  https://arxiv.org/pdf/2603.12240     GitHub
Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
Title: BiGain: Unified Token Compression for Joint Generation and Classification
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize for synthesis quality under reduced compute, yet they often ignore the model's latent discriminative capacity. We revisit token compression with a joint objective and presentBiGain, a trainingfree, plug-and-play framework that preserves generation quality while markedly improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1)Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2)Interpolate–Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision without retraining. Across DiT- and U-Net–based backbones and multiple datasets of ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our proposed operators consistently improve the speed–accuracy trade-off for diffusion-based classification, while maintaining, sometimes even enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with a token merging ratio of 70% on Stable Diffusion 2.0, BiGain improves classification accuracy by7.15%while also reducing FID for generation by 0.34 (1.85%). Our comprehensive analyses indicate that balanced spectral retention, preserving high-frequency detail alongside low/mid-frequency semantic content is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, offering a practical way to deployable, dual-purpose generative systems.
PaperID: 582,   Poster  https://arxiv.org/pdf/2603.13370     GitHub
Authors: Jiajin Liu, Dongzhe Fan, Chuanhao Ji, Daochen Zha, Qiaoyu Tan
Title: GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Abstract: VisionLanguage Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured.To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks.Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision–language models as a new foundation for multimodal graph learning.
PaperID: 583,   Poster  https://arxiv.org/pdf/2604.07997     GitHub
Authors: Yun Zhu, Jianjun Qian, Jian Yang, Jin Xie, Na Zhao
Title: Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Abstract: Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Fewshot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods.
PaperID: 584,   Poster  https://arxiv.org/pdf/2601.06391     GitHub
Authors: Saksham Singh Kushwaha, Sayan Nag, Yapeng Tian, Kuldeep Kulkarni
Title: Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
Abstract: In this paper, we introduce ObjectWIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark WIPER-Bench show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
PaperID: 585,   Poster  https://arxiv.org/pdf/2602.21552     GitHub
Authors: Changqing Zhou, Yueru Luo, Changhao Chen
Title: Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
Abstract: Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a trainingfree incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65 × faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released.
PaperID: 586,   Poster  https://arxiv.org/pdf/2505.23253     GitHub
Authors: Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Jiawei Zhou, Weiyu Li, Jiarui Liu, Fei-Peng Tian, Ping Tan
Title: UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
Abstract: We present UniTEX, a novel twostage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation.
PaperID: 587,   Poster  https://arxiv.org/pdf/2604.17135     GitHub
Authors: Zedong Dan, Zijie Wang, Wei Zhang, Xiangru Lin, Weiming Zhang, Xiao Tan, Jingdong Wang, Liang Lin, Guanbin Li
Title: OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
Abstract: Offline vectorized maps constitute critical infrastructure for highprecision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts.We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset of helpers that maximally reduce ego-centric uncertainty in occluded regions, addressing computation and redundancy challenges. Cross-Vehicle Attention (CVA) and Semantic-aware Noise Filter (SNF) then perform pose-tolerant alignment and artifact suppression before BEV-level fusion, addressing the noise challenge. This targeted pipeline yields more complete and topologically faithful maps with substantially fewer views than indiscriminate aggregation.On nuScenes and Argoverse2, OptiMVMap improves MapTRv2 by +10.5 mAP and +9.3 mAP, respectively, and surpasses memory-augmented baselines MVMap and HRMapNet by +6.2 mAP and +3.8 mAP on nuScenes. These results demonstrate that uncertainty-guided selection of helper vehicles is essential for efficient and accurate multi-vehicle vectorized mapping.
PaperID: 588,   Poster  https://arxiv.org/pdf/2509.24948     GitHub
Authors: Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, Qing Zhang
Title: RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model
Abstract: VisionLanguage-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. RehearseVLA consists of two key components: (1) a physically-consistent world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that RehearseVLA effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings.
PaperID: 589,   Poster  https://arxiv.org/pdf/2603.11605     GitHub
Authors: Junkun JIANG, Ho Yin Au, Jingyu Xiang, Jie Chen
Title: LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint textmotion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text–motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description–motion pairs and three metrics that jointly measure text–motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis. All code and data are available at https://github.com/xxx/xxx.
PaperID: 590,   Poster  https://arxiv.org/pdf/2603.15811     GitHub
Authors: Malte Prinzler, Paulo Gotardo, Siyu Tang, Timo Bolkart
Title: Feed-forward Gaussian Registration for Head Avatar Creation and Editing
Abstract: We present MATCH (Multiview Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian splats in the fixed UV layout of a template mesh. To this end, we introduce a novel registration-guided attention block, in which each UV map token attends exclusively to image tokens depicting its corresponding mesh region. MATCH outperforms existing methods for novel-view synthesis, geometry registration, and head avatar generation, the latter being 10× faster than the qualitatively closest baseline. Code and model weights will be published upon acceptance.
PaperID: 591,   Poster  https://arxiv.org/pdf/2603.20738     GitHub
Authors: Qunjie Huang, Weina Zhu
Title: SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
Abstract: Crosssubject EEG-to-image retrieval for visual decoding is hampered by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small candidate shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free test-time calibration head that operates directly on the similarity matrix of frozen EEG–image encoders. SATTC combines a geometric expert—subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS)—and a structural expert that leverages mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On the THINGS-EEG cross-subject benchmark with a strict leave-one-subject-out protocol, standardizing inference with cosine similarities, ℓ2-normalized embeddings, and candidate whitening already yields a strong baseline that improves Top-1 and Top-5 accuracy over the original ATM retrieval setup. Adding SATTC on top of this standardized inference further improves Top-1 and Top-5 accuracy and substantially reduces hubness, yielding more reliable small-k shortlists across multiple EEG encoders and establishing SATTC as a generic test-time calibration head for zero-shot neural decoding from EEG.
PaperID: 592,   Poster  https://arxiv.org/pdf/2512.14061     GitHub
Authors: Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong
Title: Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
Abstract: Recent diffusionbased one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods while maintaining efficient one-step inference.
PaperID: 593,   Poster  https://arxiv.org/pdf/2511.20351     GitHub
Authors: Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li
Title: Thinking in 360°: Humanoid Visual Search in the Wild
Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by realworld hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% → 47.38%) and path search (6.44% → 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
PaperID: 594,   Poster  https://arxiv.org/pdf/2510.01982     GitHub
Authors: yujie zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai
Title: Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Abstract: The incorporation of online reinforcement learning (RL) into diffusion and flowbased generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G^2RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G^2RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.
PaperID: 595,   Poster  https://arxiv.org/pdf/2505.21541     GitHub
Authors: Hang Zhao, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Hao Yang, Bo Yang, Yiren Song
Title: DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
Abstract: Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semitransparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large‑scale and high-quality dataset for transparent and semi‑transparent layer decomposition, containing six subtasks with different characteristics (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In‑Context Decomposition, enabling the model to predict one or multiple layers without per‑layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel‑level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance.
PaperID: 596,   Poster  https://arxiv.org/pdf/2601.03782     GitHub
Authors: Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, Li Fei-Fei
Title: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Abstract: Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pretrained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots, crucial for contact reasoning, while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild.
PaperID: 597,   Poster  https://arxiv.org/pdf/2603.29409     GitHub
Authors: Andrew Jeong, Jaemin Kim, Sebin Lee, Sung-Eui Yoon
Title: CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
Abstract: We propose CLaD (Crossmodal Latent Dynamics), a framework for learning temporally consistent cross-modal representations in robotic manipulation. Our approach models transition dynamics rather than static state correspondences: asymmetric cross-attention enables proprioceptive transitions to query semantic ones, extracting shared dynamics structure that respects the causal ordering imposed by actions. We formalize grounded latent foresight as predictions anchored through EMA-based targets from observed trajectories and auxiliary reconstruction to observable space—preventing collapse to abstract representations. A diffusion policy conditions on these learned foresights via feature modulation, decoupling dynamics learning from control optimization. Evaluated on LIBERO-LONG, our method achieves 94.9% success with 0.66B parameters, demonstrating that explicit cross-modal transition modeling enables parameter-efficient planning outperforming larger VLAs.
PaperID: 598,   Poster  https://arxiv.org/pdf/2512.19686     GitHub
Authors: Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo
Title: Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
Abstract: Recently, the introduction of Chainof-Thought (CoT) has largely improved generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the visual context consistency with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
PaperID: 599,   Poster  https://arxiv.org/pdf/2511.21113     GitHub
Authors: YuAn Wang, Xiaofan Li, Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang
Title: FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
Abstract: In controllable drivingscene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduceFaithFusion, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift.
PaperID: 600,   Poster  https://arxiv.org/pdf/2512.20013     GitHub
Authors: Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Xiao Yuchen, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao
Title: SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
Abstract: Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, singletarget commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released.
PaperID: 601,   Poster  https://arxiv.org/pdf/2512.04426     GitHub
Authors: Sidan Zhu, Hongteng Xu, Dixin Luo
Title: Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selectionthen-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance.When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods.
PaperID: 602,   Poster  https://arxiv.org/pdf/2511.16669     GitHub
Authors: JUNHAO CHENG, Liang Hou, Xin Tao, Jing Liao
Title: Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Abstract: While language models have become impactful in many realworld applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized asVideo-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduceVANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposedJoint-GRPOthat orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craftVANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization.
PaperID: 603,   Poster  https://arxiv.org/pdf/2602.19089     GitHub
Authors: Qi Sun, Can Wang, Jiaxiang Shang, Yingchun Liu, Jing Liao
Title: Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Abstract: Current 3D human animation methods fail at photorealism: kinematicsbased approaches lack non-rigid dynamics like clothing, while methods reconstructing from generated videos suffer from low-quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion and residual non-rigid motion. Then, we use a pretrained video diffusion model to restore a coarse rendering from the mesh-rigged animation, which provides supervision for the motion field. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, our core technical contribution is self-guided stochastic sampling, which effectively solves the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of a realistic 4D motion field. Ani3DHuman achieves state-of-the-art results, and our ablations validate that both components of our sampler are essential for high-fidelity restoration.
PaperID: 604,   Poster  https://arxiv.org/pdf/2602.21668     GitHub
Authors: Junmyeong Lee, Hoseung Choi, Minsu Cho
Title: Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
Abstract: Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent objectlevel motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability.
PaperID: 605,   Poster  https://arxiv.org/pdf/2507.14137     GitHub
Authors: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Elias Ramzi, Spyros Gidaris, Andrei Bursuc, Yuki M Asano
Title: Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Abstract: We present Franca (pronounced Franka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in self-supervised learning clustering methods. Existing approaches assign image features to large codebooks via clustering algorithms such as Sinkhorn-Knopp, but they often overlook the inherent ambiguity in cluster semantics. To address this, we introduce a multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, producing higher-quality dense representations. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations.This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community.
PaperID: 606,   Poster  https://arxiv.org/pdf/2512.04012     GitHub
Authors: Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, Chen Feng
Title: Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
Abstract: Reliable 3D reconstruction from inthe-wild image collections is often hindered by noisy images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios. Code will be released.
PaperID: 607,   Poster  https://arxiv.org/pdf/2603.03744     GitHub
Authors: Tuan Duc Ngo, Gabriel Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee
Title: DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
Abstract: Estimating accurate, viewconsistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging—especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.
PaperID: 608,   Poster  https://arxiv.org/pdf/2512.20615     GitHub
Authors: Xuanhua He, Tianyu Yang, Ke Cao, Rui-Qi Wu, Meng Cheng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen
Title: Active Intelligence in Video Avatars via Closed-loop World Modeling
Abstract: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency—they cannot autonomously pursue longterm goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
PaperID: 609,   Poster  https://arxiv.org/pdf/2603.00486     GitHub
Authors: Qihang Fan, Yuang Ai, Huaibo Huang, Ran He
Title: Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Abstract: Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing selfattention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods? Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. For example, compared to the classic Swin Transformer, our random grouping strategy achieves improvements of +1.3, +0.9, and +0.9 across three model sizes. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks.
PaperID: 610,   Poster  https://arxiv.org/pdf/2604.09415     GitHub
Authors: Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, DataTeam vLAR, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang
Title: PhysInOne: Visual Physics Learning and Reasoning in One Suite
Abstract: We presentPhysInOne, a largescale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 1.4 million videos across 129,400 dynamic 3D scenes, covering 68 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne’s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.
PaperID: 611,   Poster  https://arxiv.org/pdf/2508.20088     GitHub
Authors: Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou
Title: AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Abstract: Recent advances in textto-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To fill this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for inter-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code and dataset will be publicly available.
PaperID: 612,   Poster  https://arxiv.org/pdf/2602.20409     GitHub
Authors: Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee
Title: CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
Abstract: Recent visionlanguage models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.
PaperID: 613,   Poster  https://arxiv.org/pdf/2511.02607     GitHub
Authors: Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li
Title: UniChange: Unifying Change Detection with Multimodal Large Language Model
Abstract: Change detection (CD) is a fundamental task for monitoring and analysing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from singletype annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalisation and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialised CD functionalities. We introduce three special tokens: [T1], [T2], and [CHANGE], utilising their embeddings as the key to query variations. This approach successfully accommodates both BCD and SCD tasks. Furthermore, UniChange utilises text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods.
PaperID: 614,   Poster  https://arxiv.org/pdf/2603.29494     GitHub
Authors: Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun, Yong Li, CHEN ZHANG, Tao Xie
Title: VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
Abstract: Longcontext video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose VecAttention, a novel vector-wise sparse attention framework that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy–sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized vector sparse attention kernel. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65× speedup over full attention and a 1.83× speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention.
PaperID: 615,   Poster  https://arxiv.org/pdf/2512.14884     GitHub
Authors: Huzheng Yang, Katherine Xu, Andrew Lu, Michael Grossberg, Yutong Bai, Jianbo Shi
Title: Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
Abstract: Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes—their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns lowdimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.
PaperID: 616,   Poster  https://arxiv.org/pdf/2511.22466     GitHub
Authors: Xiyan Liu, Han Wang, Yuhu Wang, JUNJIE CAI, Zhe Cao, Yangjianzhong Yangjianzhong, Zhen Lu
Title: RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
Abstract: Understanding midlevel road semantics—the structural and contextual cues that bridge low-level perception and high-level planning—is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T)—a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset will be made publicly available in the final version.
PaperID: 617,   Poster  https://arxiv.org/pdf/2602.03796     GitHub
Authors: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai
Title: 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Abstract: Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novelview synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision—single-view, multi-view, and moving-camera videos—forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
PaperID: 618,   Poster  https://arxiv.org/pdf/2603.14052     GitHub
Authors: Yichang Xu, Gaowen Liu, Ramana Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, Zachary Yahn, Ling Liu
Title: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
Abstract: This paper presents a multiagent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception–action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency.
PaperID: 619,   Poster  https://arxiv.org/pdf/2510.20822     GitHub
Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Shuailei Ma, Yixuan LI, Chen Cheng, Yanhong Zeng, Xing Zhu, Yujun Shen, Huamin Qu
Title: HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Abstract: Stateof-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.
PaperID: 620,   Poster  https://arxiv.org/pdf/2601.02457     GitHub
Authors: Souhail Hadgi, Bingchen Gong, Ramana Sundararaman, Emery Pierson, Lei Li, Peter Wonka, Maks Ovsjanikov
Title: PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding
Abstract: Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local partlevel reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks.
PaperID: 621,   Poster  https://arxiv.org/pdf/2509.17704     GitHub
Authors: Bo Li, Yunkuo Lei, Tingting Bao, Hang Yan, Yaxian Wang, Weiping Fu, Lingling Zhang, Jun Liu
Title: Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
Abstract: Multifocus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box networks are difficult to generate high-quality decision maps.To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are biological neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps.Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF.Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task.Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF.Code is provided in the supplementary material and will be released.
PaperID: 622,   Poster  https://arxiv.org/pdf/2511.17405     GitHub
Authors: Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang, Mingxuan Zhao, Jing-Shu Zheng, Zheqi He, Jin-Ge Yao, Xi Yang, Bowen Qin, Jiajun Zhang
Title: Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Abstract: Multiplechoice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification.However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
PaperID: 623,   Poster  https://arxiv.org/pdf/2512.03932     GitHub
Authors: Donghun Ryou, Inju Ha, sanghyeok chu, Bohyung Han
Title: Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Abstract: Deep learningbased image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth variants using super-resolution, and employs a conditional frequency mask generator to produce adaptive frequency masks. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants to yield enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. We will publicly release our code, enhanced images and model weights to support reproducibility.
PaperID: 624,   Poster  https://arxiv.org/pdf/2505.16278     GitHub
Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Title: DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Abstract: Endto-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-π0, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which attention is directed to key visual cues rather than to all sensory inputs simultaneously. Beyond perception, we introduce an Action MoE that augments the framework by training a router to activate specialized expert modules tailored to distinct driving behaviors. Within the Action MoE, we implement two distinct styles(Token-level Router and Trajectory-level Router) and extensively explore their applicability in autonomous driving. In Bench2Drive closed-loop evaluations, DriveMoE demonstrates robust performance across diverse driving scenarios, alleviates the mode-averaging effect that limits existing models, and achieves state-of-the-art results with significant improvements over Drive-π0. We will release our code and models of DriveMoE and Drive-π0.
PaperID: 625,   Poster  https://arxiv.org/pdf/2604.08147     GitHub
Authors: Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang
Title: Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
Abstract: Recent advances in audio–visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, when these objectives are jointly optimized within a single representation space, the contrastive branch is forced to rely on randomly visible patches that often lack semantic relevance. This coupling injects semantic noise into global tokens and creates interference between generative and discriminative objectives, ultimately weakening finegrained cross-modal alignment.We revisit this formulation and propose TG-DP, a Teacher-Guided Dual-Path framework that separates reconstruction and alignment into independent optimization paths while injecting stable semantic structure into the contrastive branch. A teacher model provides holistic, unmasked semantic targets that guide the student’s token selection, allowing the alignment pathway to focus on consistently meaningful regions without being constrained by reconstruction dynamics.TG-DP yields substantial improvements in zero-shot retrieval, increasing R@1 from 35.2% to 37.4% (Vision→Audio) and 27.9% to 37.1% (Audio→Vision) on AudioSet, and from 27.9% to 31.3% and 23.2% to 30.3% on VGGSound. Despite prioritizing alignment fidelity, the learned representations remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound.Taken together, our findings show that decoupling multimodal objectives while imposing teacher-guided semantic structure provides a simple yet powerful principle for advancing large-scale audio–visual pretraining.
PaperID: 626,   Poster  https://arxiv.org/pdf/2509.01552     GitHub
Authors: Chen junjie, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen
Title: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Abstract: Large visionlanguage models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (i.e., V^2Drop), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V^2Drop is able to maintain 94.0% and 98.6% of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by 31.5% and 74.2%. When combined with efficient operators, V^2Drop further reduces GPU peak memory usage. Code is available in supplementary materials.
PaperID: 627,   Poster  https://arxiv.org/pdf/2512.11799     GitHub
Authors: Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao P. Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Y. Wang
Title: V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Abstract: Largescale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.
PaperID: 628,   Poster  https://arxiv.org/pdf/2601.14602     GitHub
Authors: Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha, Kevin Blackburn-Matzen
Title: 3D Space as a Scratchpad for Editable Text-to-Image Generation
Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chainof-thought traces or tool-augmented reasoning.Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent.We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis.Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection.The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs.Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images.Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation.Our results highlight a new paradigm for vision–language models that deliberate not only in language, but also in space.
PaperID: 629,   Poster  https://arxiv.org/pdf/2603.00717     GitHub
Authors: Qinghui He, Haifeng Zhang, Qiao Qin, Bo Liu, Xiuli Bi, Bin Xiao
Title: Diversity over Uniformity: Rethinking Representation in Generated Image Detection
Abstract: With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an antifeature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code will be publicly available upon publication.
PaperID: 630,   Poster  https://arxiv.org/pdf/2601.09452     GitHub
Authors: Ahmad Rahimi, Valentin Gerard, Éloi Zablocki, Matthieu Cord, Alex Alahi
Title: MAD: Motion Appearance Decoupling for efficient Driving World Models
Abstract: Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domainspecific data and costly fine-tuning.We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting.This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Code will be released.
PaperID: 631,   Poster  https://arxiv.org/pdf/2511.18786     GitHub
Authors: Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan
Title: STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
Abstract: We present STCDiT, a video superresolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.
PaperID: 632,   Poster  https://arxiv.org/pdf/2510.08532     GitHub
Authors: Rishubh Parihar, Or Patashnik, Daniil Ostashev, R. Venkatesh Babu, Daniel Cohen-Or, Kuan-Chieh Wang
Title: Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Abstract: Instructionbased image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
PaperID: 633,   Poster  https://arxiv.org/pdf/2512.10286     GitHub
Authors: Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao
Title: ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
Abstract: Shot transitions play a pivotal role in multishot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling.However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting.Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions.To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
PaperID: 634,   Poster  https://arxiv.org/pdf/2603.27268     GitHub
Authors: Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem
Title: TrackMAE: Video Representation Learning via Track Mask and Predict
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable selfsupervised pretraining paradigm, but only models motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve the random tube masking with a motion-aware masking strategy.We enhance video representations learned in both pixel and feature semantic reconstruction space by providing a complimentary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms the state-of-the-art video SSL baselines, therefore learning more discriminative and generalizable representations.
PaperID: 635,   Poster  https://arxiv.org/pdf/2603.06213     GitHub
Authors: Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu
Title: Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Abstract: Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating key information across multiple modalities such as videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domainspecific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduceCoE, a training-free MMS framework that performs structured reasoning through aChain-of-Eventsguided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into a structured prior that serves as a global scaffold for cross-modal reasoning.Guided by this hierarchy,CoEfirst performs cross-modal grounding to localize key visual cues, followed by event-evolution reasoning to capture temporal dependencies and causal transitions across the video.A lightweight style adaptation module further refines the generated summaries to match domain-specific linguistic conventions. Extensive experiments on eight diverse datasets demonstrate thatCoEconsistently outperforms state-of-the-art video CoT baselines, achieving average gains of+3.04 ROUGE,+9.51 CIDEr, and+1.88 BERTScore, highlighting its robustness, interpretability, and superior cross-domain generalization.
PaperID: 636,   Poster  https://arxiv.org/pdf/2603.27516     GitHub
Authors: jiahao niu, rongjia zheng, Wenju Xu, Wei-Shi Zheng, Qing Zhang
Title: SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
Abstract: We presents SGSIntrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material–illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination–material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a de-shadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code will be made publicly available.
PaperID: 637,   Poster  https://arxiv.org/pdf/2603.25267     GitHub
Authors: Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao
Title: EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
Abstract: Textvideo retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX.
PaperID: 638,   Poster  https://arxiv.org/pdf/2603.06449     GitHub
Authors: Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang
Title: CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains nontrivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 22.72 PSNR and 0.681 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.
PaperID: 639,   Poster  https://arxiv.org/pdf/2604.10485     GitHub
Authors: Haopeng Chen, Yihao Ai, Kabeen Kim, Robby T. Tan, Yixin Chen, Bo Wang
Title: UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Abstract: Lowvisibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions.But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes.Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions.To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes realistic low-light images and dynamically fuses visual cues with pose priors for improved pose estimation.Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches.Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture.Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the challenging ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC.
PaperID: 640,   Poster  https://arxiv.org/pdf/2512.09824     GitHub
Authors: Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran ZHAO, Songchun Zhang, Anyi Rao
Title: Composing Concepts from Images and Videos via Concept-prompt Binding
Abstract: Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a oneshot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
PaperID: 641,   Poster  https://arxiv.org/pdf/2507.10065     GitHub
Authors: Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu
Title: MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Abstract: We present MoVieS, a motionaware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.
PaperID: 642,   Poster  https://arxiv.org/pdf/2602.19863     GitHub
Authors: Filip Wolf, Blaz Rolih, Luka Cehovin Zajc
Title: Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
Abstract: Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dualteacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.
PaperID: 643,   Poster  https://arxiv.org/pdf/2602.20223     GitHub
Authors: Wall Kim, Chaeyoung Song, Hanul Kim
Title: MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Abstract: Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the MultiModal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention sampler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning.
PaperID: 644,   Poster  https://arxiv.org/pdf/2506.18512     GitHub
Authors: Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Title: MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Abstract: Accurate and interpretable multidisease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code will be available on GitHub.
PaperID: 645,   Poster  https://arxiv.org/pdf/2506.21742     GitHub
Authors: Swetha Sirnam, Rohit Gupta, Parth Parag Kulkarni, David Shatwell, Jeffrey A. Chan-Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah
Title: VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
Abstract: Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content actions, objects, and events directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Humans naturally excel at such implicit reasoning, seamlessly integrating partial visual cues over time to infer motin dynamics, spatial layout and context, constructing a coherent mental model of the scene even when such relationships are never explicitly depicted. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring viewers to infer them. VRR-QA comprises 1K meticulously expert-annotated QA pairs drawn from 1K creative video clips covering 15 genres across 7 decades of content, from both live-action and animated titles. These annotations are deliberately challenging, crafted by authors, validated through multiple annotators, and benchmarked against human performance to ensure high quality. Our extensive evaluations on 11 leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Even the best model substantially underperforms human baselines with only 64% accuracy. Performance variations across models further illustrate the complexity and diversity of the challenges presented by VRR-QA. By releasing both dataset and data collection framework, VRR-QA establishes a rigorous, diverse, and reproducible testbed for advancing VideoQA.
PaperID: 646,   Poster  https://arxiv.org/pdf/2512.10954     GitHub
Authors: Sicheng Mo, Thao Nguyen, Richard Zhang, Nicholas Kolkin, Siddharth Iyer, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li
Title: Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and interimage correspondence. We observe a clear scaling effect — larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256×256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
PaperID: 647,   Poster  https://arxiv.org/pdf/2604.15310     GitHub
Authors: Sumit Chaturvedi, Yannick Hold-Geoffroy, Mengwei Ren, Jingyuan Liu, He Zhang, Yiqun Mei, Julie Dorsey, ZHIXIN SHU
Title: TokenLight: Precise Lighting Control in Images using Attribute Tokens
Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a largescale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly.
PaperID: 648,   Poster  https://arxiv.org/pdf/2512.04761     GitHub
Authors: Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu
Title: Order Matters: 3D Shape Generation from Sequential VR Sketches
Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketchto-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch–shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will be released in open access.
PaperID: 649,   Poster  https://arxiv.org/pdf/2512.08269     GitHub
Authors: Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo
Title: EgoX: Egocentric Video Generation from a Single Exocentric Video
Abstract: Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (thirdperson) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input.EgoX leverages the pretrained spatio–temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width- and channel-wise concatenation.Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity.Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.
PaperID: 650,   Poster  https://arxiv.org/pdf/2603.04800     GitHub
Authors: lulu hu, Xiao Wenhu, Chen Xin, Xinhua Xu, Bowen Xu, Kun Li, Yongliang Tao
Title: MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Abstract: Posttraining quantization (PTQ) with computational equivalence for Large Language Models (LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models (MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms.
PaperID: 651,   Poster  https://arxiv.org/pdf/2604.04484     GitHub
Authors: Junyoung Park, Youngjin Oh, Nam Ik Cho
Title: TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
Abstract: Blindspot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel during training, allowing the network to estimate clean signals without ground-truth supervision.However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise introduced by the camera's image signal processing (ISP) pipeline.While several methods employ downsampling strategies to decorrelate noise, these approaches alter noise statistics and limit the network's ability to utilize full contextual information.In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise.This correlation originates from the demosaicing process, where each pixel is reconstructed from neighboring samples with weights that decay spatially, resulting in a characteristic diamond-shaped pattern.To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original image resolution.This design effectively excludes correlated pixels while fully leveraging uncorrelated contextual information, eliminating the need for downsampling or post-processing.Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency.Extensive experiments on real-world denoising benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches.
PaperID: 652,   Poster  https://arxiv.org/pdf/2512.06581     GitHub
Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Title: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Abstract: Large visionlanguage models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce MedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emphcross-dataset reward normalization that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emphmedical LLM judge that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.
PaperID: 653,   Poster  https://arxiv.org/pdf/2512.10894     GitHub
Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu
Title: DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Abstract: Recent vision–language model (VLM)–based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an endto-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model’s native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
PaperID: 654,   Poster  https://arxiv.org/pdf/2603.03726     GitHub
Authors: Guohua Zhang, Jian Jin, Meiqin Liu, Chao Yao, Weisi Lin
Title: QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
Abstract: NoReference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks.
PaperID: 655,   Poster  https://arxiv.org/pdf/2603.19678     GitHub
Authors: Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou
Title: Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
Abstract: Lifelong person reidentification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our code will be released.
PaperID: 656,   Poster  https://arxiv.org/pdf/2511.22265     GitHub
Authors: Yuan Yao, Lixu Wang, Jiaqi Wu, Jin Song, Simin Chen, Zehua Wang, Zijian Tian, Wei Chen, Huixia Li, Xiaoxiao Li
Title: FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
Abstract: Federated learning (FL) enables collaborative training across clients without compromising privacy. While most existing FL methods assume homogeneous model architectures, client heterogeneity in data and resources renders this assumption impractical, motivating modelheterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. In FedRE, each client aggregates its local representations into a single entangled representation using normalized random weights and applies the same weights to integrate the corresponding one-hot label encodings into the entangled-label encoding. Those are then uploaded to the server to train a global classifier. During training, each entangled representation is supervised across categories via its entangled-label encoding, while random weights are re-sampled each round to introduce diversity, mitigating the global classifier’s overconfidence and promoting smoother decision boundaries. Furthermore, each client uploads a single cross-category entangled representation along with its entangled-label encoding, mitigating the risk of representation inversion attacks and reducing communication overhead. Extensive experiments demonstrate that FedRE achieves an effective trade-off among model performance, privacy protection, and communication overhead.
PaperID: 657,   Poster  https://arxiv.org/pdf/2603.23883     GitHub
Authors: Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Kuniaki Saito, Hiroaki Santo, Fumio Okura
Title: BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
Abstract: Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology.While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visualtextual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding.
PaperID: 658,   Poster  https://arxiv.org/pdf/2512.00074     GitHub
Authors: Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing Xu
Title: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
Abstract: Despite strong results on recognition and segmentation, current 3D visual pretraining methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state–action–state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a scalable self-supervised framework that learns dynamics-aware 3D representations directly from point clouds without action or label supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy for control, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for dynamics-aware 3D representation learning in robotics.
PaperID: 659,   Poster  https://arxiv.org/pdf/2603.24326     GitHub
Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Lin Manhui, Yue Zhang, yubo zhang, Jing Zhang, Jun Zhang, Xing Wei, Yi Liu, Dianhai Yu, Yanjun Ma
Title: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Abstract: Document parsing is a finegrained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding.
PaperID: 660,   Poster  https://arxiv.org/pdf/2509.25339     GitHub
Authors: Paul Gavrikov, Wei Lin, Muhammad Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, James Glass, Serena Yeung, Hilde Kuehne
Title: VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Abstract: Is basic visual understanding really solved in stateof-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question–answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.8 % accuracy on our hardest test split and overall 69.5 % accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models.
PaperID: 661,   Poster  https://arxiv.org/pdf/2512.00885     GitHub
Authors: Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi
Title: HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
Abstract: Hand–object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatiotemporal effects on objects.However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI.We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs.Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes.HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation.We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding.We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.
PaperID: 662,   Poster  https://arxiv.org/pdf/2603.10341     GitHub
Authors: Chen-Chen Zong, Sheng-Jun Huang
Title: Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
Abstract: Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of querymodel selection in FAL and uncover a central insight: the model that achieves more class-balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable.Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. FairFAL (1) infers global imbalance and local–global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) applies a two-stage uncertainty–diversity balanced sampling strategy with k-center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings.
PaperID: 663,   Poster  https://arxiv.org/pdf/2604.19141     GitHub
Authors: Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, Björn Ommer
Title: Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Abstract: Diffusion and flowbased models typically allocate compute uniformly across space, updating every patch with the same noise level and number of steps. However, images are highly heterogeneous and not all regions are equally difficult to denoise. We introduce Patch Forcing (PF), a framework that dynamically allocates compute to regions that require more refinement than others. Using an additional head that predicts per-patch difficulty, we can formulate adaptive samplers that dynamically allocate compute where it is most needed. With noise scales that can vary over space and diffusion time, combined with our adaptive solvers, we can advance easier regions earlier to provide context for harder ones. We show that our framework achieves competitive results on class-conditional ImageNet, while remaining orthogonal to guidance methods. We further show that our method also scales to text-to-image synthesis. With Patch Forcing we hope to open a path towards a new family of samplers that allocate compute adaptively, focusing effort on the hardest parts of an image.
PaperID: 664,   Poster  https://arxiv.org/pdf/2512.11141     GitHub
Authors: Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon
Title: Learning complete and explainable visual representations from itemized text supervision
Abstract: Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially nonobject-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable.
PaperID: 665,   Poster  https://arxiv.org/pdf/2604.02966     GitHub
Authors: Wenhao Li, Zimeng Wu, Yu Wu, Zehua Fu, Jiaxin Chen
Title: Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
Abstract: Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layoutto-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region-Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. We will release source code upon acceptance.
PaperID: 666,   Poster  https://arxiv.org/pdf/2604.19202     GitHub
Authors: Bo Li, Jiahao Kang, Yubo Ma, Feng-Lin Liu, Bin Liu, Fang-Lue Zhang, Lin Gao
Title: SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
Abstract: 3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with realtime rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes—especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.
PaperID: 667,   Poster  https://arxiv.org/pdf/2604.03069     GitHub
Authors: Zicheng Zhang, Xiangting Meng, Ke Wu, Wenchao Ding
Title: SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
Abstract: Recent progress in feedforward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22% of the Gaussians and maintain reasonable rendering quality with only 1.5% of the Gaussians.
PaperID: 668,   Poster  https://arxiv.org/pdf/2604.05761     GitHub
Authors: Amadou S. SANGARE, Adrien Maglo, Mohamed Chaouch, Bertrand Luvison
Title: Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
Abstract: Textto-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed x_0-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2x according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy.
PaperID: 669,   Poster  https://arxiv.org/pdf/2512.04884     GitHub
Authors: Tim Engelbracht, René Zurbrügg, Matteo Wohlrapp, Martin Büchner, Abhinav Valada, Marc Pollefeys, Hermann Blum, Zuria Bauer
Title: Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Abstract: We present a dataset for forcegrounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provide synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.
PaperID: 670,   Poster  https://arxiv.org/pdf/2603.24078     GitHub
Authors: Yuheng Feng, Wen Zhang, Haodong Duan, Xingxing Zou
Title: PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
Abstract: We present PosterIQ, a designdriven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image–annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text–image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision–language systems.
PaperID: 671,   Poster  https://arxiv.org/pdf/2603.29034     GitHub
Authors: Kushal Vyas, Alper Kayabasi, Daniel Kim, Vishwanath Saragadam, Ashok Veeraraghavan, Guha Balakrishnan
Title: The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
Abstract: The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. Several datadriven INR parameter initialization methods demonstrate significant improvement over standard random sampling, but the reason for their successes -- whether they encode classical statistical signal priors or something more sophisticated -- is not well understood. In this study, we explore this topic with a series of experimental analyses leveraging noise pretraining. In particular, we pretrain INRs on noise signals of different classes (e.g., Gaussian, Dead Leaves, Spectral), and measure their abilities at both fitting unseen signals and encoding priors for an inverse imaging task (denoising). Our analyses on image and video data reveal the highly surprising finding that simply pretraining on unstructured noise (Uniform, Gaussian) results in a dramatic improvement in signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, noise with the classic 1/|f^\alpha| spectral structure of natural images yields an excellent balance of both signal fitting and inverse imaging capabilities on par with the best data-driven initialization methods. This finding can enable more efficient training of INRs in applications without sufficient prior domain-specific data.
PaperID: 672,   Poster  https://arxiv.org/pdf/2503.03222     GitHub
Authors: Zhumei Wang, zechenhu zechenhu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Liang Zhang, Mingtao Pei, Siyuan Huang
Title: Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Abstract: Human motion recovery for realworld interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on public 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability. Our code will be made publicly available.
PaperID: 673,   Poster  https://arxiv.org/pdf/2603.05921     GitHub
Authors: Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang
Title: BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Abstract: This paper investigates the challenging task of detecting backdoored textto-image generative models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit more significant cross-sample consistency than those from clean ones. Despite their success, such global signals struggle to generalize to recently emerging backdoor attacks, where backdoored generations can also appear visually diverse. Our BlackMirror is motivated by an insightful observation: across a wide range of backdoor attacks, only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two core components: MirrorMatch, which aligns extracted visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. Note that BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module for detecting backdoor risks in real-world Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of existing attacks. It surpasses prior methods by over 15% in detection performance and reduces false positives by more than 30%.
PaperID: 674,   Poster  https://arxiv.org/pdf/2603.12267     GitHub
Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu
Title: EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduceEVATok, a framework to produceEfficientVideoAdaptiveTokenizers. Our framework estimates optimal token assignments for each video to achieve the best qualitycost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
PaperID: 675,   Poster  https://arxiv.org/pdf/2604.06795     GitHub
Authors: Huy Quang Le, Loc Nguyen, Yu Qiao, Seong Tae Kim, Eui-Nam Huh, Choong Seon Hong
Title: FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
Abstract: Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacysensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a single global prototype per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is domain-agnostic, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges.
PaperID: 676,   Poster  https://arxiv.org/pdf/2603.12245     GitHub
Authors: Moayed Haji Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
Title: One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latencyquality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores.
PaperID: 677,   Poster  https://arxiv.org/pdf/2604.07786     GitHub
Authors: Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Title: Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Abstract: Talking face generation has gained significant attention as a core application of generative models.To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role.However, existing approaches often limit expressive flexibility and struggle to generate extended emotions.Labelbased methods represent emotions with discrete categories, which fail to capture a wide range of emotions.Audio-based methods can leverage emotionally rich speech signals—and even benefit from expressive text-to-speech (TTS) synthesis—but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches.Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm).To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces.C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities.Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos—even for unseen extended emotions. All source code and checkpoint will be released upon acceptance, including video samples.
PaperID: 678,   Poster  https://arxiv.org/pdf/2603.21386     GitHub
Authors: Nikolay Kormushev, Josip Šarić, Matej Kristan
Title: Mitigating Objectness Bias and Regionto-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Abstract: Openvocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision–language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code will be available here.
PaperID: 679,   Poster  https://arxiv.org/pdf/2602.20496     GitHub
Authors: Jintu Zheng, Qizhe Liu, Huangxin Xu, zhuojie Chen
Title: Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
Abstract: While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a nearsingle-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28× speedup, 76.6% memory peak reduction and 80.9% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320×640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods.
PaperID: 680,   Poster  https://arxiv.org/pdf/2602.22613     GitHub
Authors: Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa T. Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Kompella
Title: Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
Abstract: Visionlanguage foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over state-of-the-art baselines, demonstrating an efficient and deployable path toward spectrum-aware vision-language learning for Earth observation.
PaperID: 681,   Poster  https://arxiv.org/pdf/2604.20286     GitHub
Authors: Md Maklachur Rahman, Soon Ki Jung, Tracy Hammond
Title: MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
Abstract: Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a UNet architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). These modules are designed to enhance local–global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12% and average Dice score of 93.09% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state-of-the-art models.Compared to U-Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6% and GFLOPs by 97.6%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61% IoU and 87.23% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation.Code is available at https://github.com/abcdef987412365/MambaLiteUNet.
PaperID: 682,   Poster  https://arxiv.org/pdf/2512.11782     GitHub
Authors: Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao
Title: MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentationlike mattes lacking fine details. To this end, we introduce a learned Quality Evaluator (QE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The QE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
PaperID: 683,   Poster  https://arxiv.org/pdf/2603.18834     GitHub
Authors: Hesong Li, Ziqi Wu, Ruiwen Shao, Ying Fu
Title: Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging
Abstract: HighResolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task.
PaperID: 684,   Poster  https://arxiv.org/pdf/2505.19795     GitHub
Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Title: The Missing Point in Vision Transformers for Universal Image Segmentation
Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent maskbased approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation.
PaperID: 685,   Poster  https://arxiv.org/pdf/2603.20778     GitHub
Authors: Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan
Title: PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Abstract: We present PiLoT, a unified framework that tackles UAVbased ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion.Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset will be made publicly available.
PaperID: 686,   Poster  https://arxiv.org/pdf/2511.19413     GitHub
Authors: Zhaolong Su, Wang Lu, Hao Chen, Yixuan Li, Jindong Wang
Title: UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstructionrich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces <1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models.
PaperID: 687,   Poster  https://arxiv.org/pdf/2512.21514     GitHub
Authors: Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li, Xiangyang Ji
Title: DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
Abstract: Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model.This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on singlesample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off.Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13%~18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.
PaperID: 688,   Poster  https://arxiv.org/pdf/2603.08018     GitHub
Authors: Yafei Zhang, Meng Ma, Huafeng Li, Yu Liu
Title: Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
Abstract: Infrared–visible (IR–VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixelspace generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This \emphencode\rightarrowtransfer\rightarrowfuse\rightarrowreconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary–coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference–fusion to tackle missing-IR fusion.
PaperID: 689,   Poster  https://arxiv.org/pdf/2603.25108     GitHub
Authors: Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tian Hua Zhou, JingBo Zhu, Tong Xiao
Title: MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning with verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVRbased training typically depends on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale the training of MRMs. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable reinforcement learning for MRMs with limited multimodal data. MSRL redefines the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., 68.5%\rightarrow74.8% on VLReward Bench, 69.2%\rightarrow75.4% on GenAI-Bench), without requiring additional multimodal preference annotations.
PaperID: 690,   Poster  https://arxiv.org/pdf/2604.02719     GitHub
Authors: Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner
Title: MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications
Abstract: We introduce MOMO, the first multisensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of ~12 million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data.
PaperID: 691,   Poster  https://arxiv.org/pdf/2602.19053     GitHub
Authors: Qingwen Zhang, Chenhan Jiang, Xiaomeng Zhu, Yunqi Miao, Yushan Zhang, Olov Andersson, Patric Jensfelt
Title: TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
Abstract: Selfsupervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals.In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames.Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains ofup to 33%on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up150times.
PaperID: 692,   Poster  https://arxiv.org/pdf/2601.04792     GitHub
Authors: Denis Korzhenkov, Adil Karjauv, Animesh Karnewar, Mohsen Ghafoorian, Amirhossein Habibian
Title: PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multistep denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility.In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos.Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency.
PaperID: 693,   Poster  https://arxiv.org/pdf/2604.09817     GitHub
Authors: Weijian Mai, Mu Nan, Yu Zhu, Jiahang Cao, Rui Zhang, Yuqin Dai, Chunfeng Song, Andrew Luo, Jiamin Wu
Title: NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Abstract: Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (i) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (ii) Crossmodal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding–decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain–computer interfaces. Code will be released to facilitate future research.
PaperID: 694,   Poster  https://arxiv.org/pdf/2601.02356     GitHub
Authors: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
Title: Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for textinstructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations—such as translating, rotating, or resizing objects—due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations.Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
PaperID: 695,   Poster  https://arxiv.org/pdf/2506.00318     GitHub
Authors: SARA GHAZANFARI, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg
Title: Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Abstract: Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks.This approach has been extended to multimodal LLMs, where the models can produce chainsof-thoughts (CoT) about the content of input images and videos. For video inputs, prior works use complex multi-step pipelines that extract and include relevant frames from videos in the CoT, or produce simpler single-stage reasoning traces at the expense of poor temporal grounding.Here, we propose the first video LLMs with single-stage reasoning that includes explicit references to relevant frames, thereby reducing temporal inconsistencies in the reasoning process. Our approach is simple, unified, and self-contained, employing a single-stage inference to handle complex video understanding tasks without relying on auxiliary modules for frame selection or caption generation.For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces from both natural and synthetic videos, spanning various topics and tasks.Our models, obtained by fine-tuning video LLMs on this chain-of-frames (CoF) data, generate reasoning traces that accurately identify key frames to answer given questions.In turn, this consistently improves performance across multiple video understanding benchmarks. Surprisingly, we find that synthetic data alone, despite being out-of-distribution with respect to these real-world benchmarks, provides a significant boost in model accuracy.
PaperID: 696,   Poster  https://arxiv.org/pdf/2604.00503     GitHub
Authors: Weifu Fu, Jinyang Li, Bin-Bin Gao, Jialin Li, Yuhuan Lin, Hanqiu Deng, Wenbing Tao, Yong Liu, Chengjie Wang
Title: PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
Abstract: OpenSet Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text paired samples for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, extending the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal object detector supporting both text and visual prompts. Our visual prompt generation scheme builds on an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, parallel alignment with diverse real-world usage scenarios, and improved classification. Extensive experiments demonstrate that our visual prompt generation scheme, based on text-prompt-based detection pretraining, achieves a higher performance ceiling compared to using visual prompts alone.Our method achieves significant zero-shot detection performance on COCO, LVIS, and ODinW, and excels across various prompt-based detection protocols. In-domain evaluations also demonstrate robust localization performance.
PaperID: 697,   Poster  https://arxiv.org/pdf/2603.04803     GitHub
Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang
Title: Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Abstract: The limited understanding capacity of the visual encoder in Contrastive LanguageImage Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. In particular, on OpenAICLIP, DCR boosts D-Ability by 5% while preserving the gains in P-Ability.
PaperID: 698,   Poster  https://arxiv.org/pdf/2603.26597     GitHub
Authors: Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang
Title: From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
Abstract: Recent studies have made notable progress in video representation learning by transferring imagepretrained models to video tasks. This process typically introduces complex temporal processing modules with fine-tuning on video data. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters conversely hinders their intra-video temporal consistency, which is required to produce stable representations for the same object in a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent performance improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available in the Supplemental Material.
PaperID: 699,   Poster  https://arxiv.org/pdf/2601.00664     GitHub
Authors: Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
Title: Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Abstract: Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating oneway responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user’s audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (about 500ms), achieving 6.8x speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
PaperID: 700,   Poster  https://arxiv.org/pdf/2601.10649     GitHub
Authors: Darshan Singh S, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
Title: CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature westerncentric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE, a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. We will release CURVE to foster the development of more equitable and capable multimodal foundation models.
PaperID: 701,   Poster  https://arxiv.org/pdf/2511.21025     GitHub
Authors: Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu
Title: CaptionQA: Is Your Caption as Useful as the Image Itself?
Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, multistep agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains, Natural, Document, E-commerce, and Embodied AI, each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains.
PaperID: 702,   Poster  https://arxiv.org/pdf/2601.05138     GitHub
Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
Title: VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Abstract: Video world models aim to simulate dynamic, realworld environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset. We will release our code, model, and dataset to support research in realistic and controllable video world models.
PaperID: 703,   Poster  https://arxiv.org/pdf/2507.08492     GitHub
Authors: Heng Li, Xiangping Wu, Qingcai Chen
Title: D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping
Abstract: Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a finegrained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and automatic rendering engine to build a new large-scale distortion training dataset named DocDewarpHV. The code and dataset will be publicly released. On three public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods.
PaperID: 704,   Poster  https://arxiv.org/pdf/2604.01666     GitHub
Authors: Wonjoon Jin, Jiyun Won, Janghyeok Han, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
Title: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring finegrained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control. Codes and datasets will be publicly available.
PaperID: 705,   Poster  https://arxiv.org/pdf/2603.00149     GitHub
Authors: Zhihao LI, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang
Title: Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Abstract: Existing image SR and generic diffusion models transfer poorly to fluid SR: they are samplingintensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) withReMD(Residual-MultigridDiffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs amultigrid residual correction: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with amulti-waveletbasis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistencyinsidethe diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR.
PaperID: 706,   Poster  https://arxiv.org/pdf/2512.13303     GitHub
Authors: Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, Hongtao Xie
Title: ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise datato-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
PaperID: 707,   Poster  https://arxiv.org/pdf/2511.20646     GitHub
Authors: Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li
Title: 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multitask learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
PaperID: 708,   Poster  https://arxiv.org/pdf/2603.00906     GitHub
Authors: ZENG XIAOLONG, Yitong Yu, Shiyao Xiong, Jinhua Hao, Ming Sun, Chao Zhou, Bin Wang
Title: ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Abstract: LookUp Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices.To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality.Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead.Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8× larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
PaperID: 709,   Poster  https://arxiv.org/pdf/2512.00677     GitHub
Authors: Dong In Lee, Hyungjun Doh, Seunggeun Chi, Runlin Duan, Sangpil Kim, Karthik Ramani
Title: Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Abstract: Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, textdriven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing.Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS.Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches.
PaperID: 710,   Poster  https://arxiv.org/pdf/2602.19596     GitHub
Authors: Yihang Tao, Senkang Hu, Haonan An, Zhengru Fang, Hangcheng Cao, Yuguang Fang
Title: Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
Abstract: Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via featurelevel perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62% against state-of-the-art defenses while achieving 47% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems.
PaperID: 711,   Poster  https://arxiv.org/pdf/2511.19836     GitHub
Authors: Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen
Title: 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
Abstract: World Generation Models are emerging as a cornerstone of nextgeneration multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition–4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical Realism, and cross-modal coherence.Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments.We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation."
PaperID: 712,   Poster  https://arxiv.org/pdf/2602.20903     GitHub
Authors: Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, ChaoFeng ChaoFeng, Can Huang, Jingqun Tang, Xiang Bai
Title: TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
Abstract: Visual Text Rendering (VTR) remains a critical challenge in text‑to‑image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment.% We identify a key bottleneck across both VTR evaluation and Reinforcement Learning (RL) processes: current evaluators and reward models lack the ability for finegrained structural perception. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL‑based optimization. As a result, even state‑of‑the‑art generators (e.g., SeedDream4.0, Qwen‑Image) still struggle to render structurally faithful text.To address this, we propose TextPecker,a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any text-to-image generator. To enable this capability, we construct a recognition dataset with character‑level structural‑anomaly annotations and develop a stroke‑editing synthesis engine to expand structural‑error coverage. Experiments show that TextPecker consistently improves diverse text‑to‑image models; even on the well‑optimized Qwen‑Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR.Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.
PaperID: 713,   Poster  https://arxiv.org/pdf/2512.15701     GitHub
Authors: Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Y. Zhang
Title: VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Abstract: Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception.In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on largescale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights.
PaperID: 714,   Poster  https://arxiv.org/pdf/2512.02787     GitHub
Authors: xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li, Liangming Chen, Lei Ren, Yonglu Li
Title: Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Abstract: VisionLanguage-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Code and data will be publicly available.
PaperID: 715,   Poster  https://arxiv.org/pdf/2603.23153     GitHub
Authors: August Leander Høeg, Sophia Bardenfleth, Hans Martin Kjer, Tim Dyrby, Vedrana Dahl, Anders Dahl
Title: VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
Abstract: Recent advances in volumetric superresolution (SR) have demonstrated great performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. We show that this impressive performance largely stems from training on downsampled data rather than real low-resolution scans. Such a training setup arises partly from the limited availability of paired high- and low-resolution volumetric datasets. To address this gap, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: models trained on downscaled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying downscaled trained models to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans but instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available at linkwhenpublished.
PaperID: 716,   Poster  https://arxiv.org/pdf/2603.02210     GitHub
Authors: Yi Chen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, Shilei Wen, Weiqiang Wang, Pheng-Ann Heng
Title: HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Abstract: Humanproduct images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images. Our data, model, and code will be publicly available.
PaperID: 717,   Poster  https://arxiv.org/pdf/2511.22396     GitHub
Authors: Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yan Yiming, Chen Yijun, Wang Guo, Haifeng Li
Title: Asking like Socrates: Socrates helps VLMs understand remote sensing images
Abstract: Recent multimodal reasoning models, inspired by DeepSeekR1, have significantly advanced vision–language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models will be available.
PaperID: 718,   Poster  https://arxiv.org/pdf/2603.21626     GitHub
Authors: Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo
Title: PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGRNet (Prior-Guided Region Network)—an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian–Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS 2019, BraTS 2023, and MSD Task01 show that PGR-Net consistently outperforms existing approaches, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor (WT) region, respectively.
PaperID: 719,   Poster  https://arxiv.org/pdf/2603.28120     GitHub
Authors: Yang Guangjing, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinglin Wang, Wanran Sun, Zhenyu Zhang, Bing Ji, Qicheng Lao
Title: MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
Abstract: Medical visual grounding serves as a crucial foundation for finegrained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code and checkpoints will be released after acceptance.
PaperID: 720,   Poster  https://arxiv.org/pdf/2603.10929     GitHub
Authors: Yu Fanqi, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
Title: Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
Abstract: We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving intertask distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10–17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code will be made publicly available upon acceptance of the paper.
PaperID: 721,   Poster  https://arxiv.org/pdf/2604.08810     GitHub
Authors: ZEWEI ZHOU, Jiajun Zou, Jiajia Zhang, Ao Yang, Ruichao He, Haozheng Zhou, Ao Liu, Jiawei Liu, Leilei Jin, Shan Shen, Daying Sun
Title: R2G:A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Abstract: Progress in machine learning for electronic design automation (EDA) is constrained by the lack of open, multiview graph datasets that coherently represent the same circuits across late physical-design stages. We present R2G (RTL-to-GDSII), a standardized benchmark and framework that converts DEF files into typed, heterogeneous, information-preserving circuit graphs and supports node- and edge-level tasks in placement and routing. R2G provides five stage-aware views with information parity and includes loaders, unified splits, domain-specific metrics, and reproducible baselines—enabling fair cross-view comparison and isolating representation from modeling. In systematic studies with classic GNNs (GIN, GAT, GatedGCN), we show that view choice strongly affects performance, varies with stage and supervision, and that decoder-head depth (3--4 layers) improves accuracy and stability; these findings connect view semantics to objectives and message passing and offer practical guidance. By bridging EDA semantics and graph learning, R2G releases large-scale datasets and an end-to-end pipeline, creating an open testbed for principled representation design. Datasets, loaders, and evaluation scripts will be released on GitHub.
PaperID: 722,   Poster  https://arxiv.org/pdf/2512.01236     GitHub
Authors: Shulei Wang, Longhui Wei, XIN HE, Jianbo Ouyang, Hui Lu, Zhou Zhao, Qi Tian
Title: PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
Abstract: Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of highquality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation.
PaperID: 723,   Poster  https://arxiv.org/pdf/2504.16930     GitHub
Authors: David Yan, Alexander Raistrick, Jia Deng
Title: What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zeroshot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system to enable further research on procedural stereo datasets.
PaperID: 724,   Poster  https://arxiv.org/pdf/2603.04291     GitHub
Authors: Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan
Title: CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
Abstract: Generating highquality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience.Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting \leq 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution.We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams.Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios.
PaperID: 725,   Poster  https://arxiv.org/pdf/2603.11346     GitHub
Authors: Yuto Shibata, Kashu Yamazaki, Lalit Jayanti, Yoshimitsu Aoki, Mariko Isogawa, Katerina Fragkiadaki
Title: Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors remain largely isolated and noninteractive. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics.In this paper, we formulate the imitation of closely interacting, force-exchanging human–human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) and the recipient in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant’s reference motion to the recipient’s real-time pose and encourage physically meaningful support.We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control. We will make our code available upon acceptance.
PaperID: 726,   Poster  https://arxiv.org/pdf/2603.03827     GitHub
Authors: Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang
Title: Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
Abstract: Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in realworld dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens extracted by encoders, capturing localized semantic cues. These tokens are then abstracted into semantic concepts via a label-guided clustering strategy, yielding mid-level intent-aware patterns. To capture higher-order structure, inter-concept relations are selected through a JS-divergence-based mechanism to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER not only consistently outperforms state-of-the-art methods and MLLMs with 1–3% gains across all metrics, but also exhibits strong generalization across diverse backbones.
PaperID: 727,   Poster  https://arxiv.org/pdf/2603.27599     GitHub
Authors: Yixing Zhu, Qing Zhang, Wenju Xu, Wei-Shi Zheng
Title: You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusionbased methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Our code and trained model will be publicly released.
PaperID: 728,   Poster  https://arxiv.org/pdf/2604.00940     GitHub
Authors: Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar, Francisco Mena, Cristhian Sanchez, Akshay Pai, Diego Arenas, Matias Valdenegro, Marcela Charfuelan, Marlon Nuske, Andreas Dengel
Title: YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
Abstract: Crop yield prediction requires substantial data to train datadriven models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across two continents and multiple countries, including Argentina, Brazil, Uruguay, and Germany. The dataset was collected between 2016 and 2024 and comprises four major crop types—corn, rapeseed, soybeans, and wheat—across 2,176 curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of \SI10m. Each field is paired with multispectral satellite imagery, resulting in 113,630 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as an image regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data. To mitigate this, we explore a domain-informed Deep Ensemble that exhibits greater diversity in the weight space.
PaperID: 729,   Poster  https://arxiv.org/pdf/2603.07659     GitHub
Authors: Kaihua Tang, JIAXIN QI, Jinliou Jinliou, Yuhua Zheng, Jianqiang Huang
Title: Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
Abstract: The emergence of Large Language Models (LLMs) has driven rapid progress in multimodal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.
PaperID: 730,   Poster  https://arxiv.org/pdf/2603.04839     GitHub
Authors: Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiaojun Wu, Josef Kittler
Title: Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Abstract: With the rapid advancement and widespread application of visionlanguage pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks.However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability.To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation.SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts.This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples.Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.The code will be released here.
PaperID: 731,   Poster  https://arxiv.org/pdf/2603.02286     GitHub
Authors: Yaoteng Zhang, Qing Zhou, Junyu Gao, Qi Wang
Title: Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection
Abstract: Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, promptbased methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity.
PaperID: 732,   Poster  https://arxiv.org/pdf/2603.09418     GitHub
Authors: Bohao Li, Zhicheng Cao, Huixian Li, Yangming Guo
Title: CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Abstract: Stateof-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency.
PaperID: 733,   Poster  https://arxiv.org/pdf/2603.19776     GitHub
Authors: Chengzhi Hong, Bijun Li
Title: ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchoror curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric–topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in \mathbbR^3, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, coupling metric and topology across surfaces, curves, and samples. Building on this, we propose ReManNet: it first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features via a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point–curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive performance, and on OpenLane it improves F1 by +8.2% over the baseline and by +1.8% over the previous SOTA, with scenario-level F1 gains of up to +6.6%.
PaperID: 734,   Poster  https://arxiv.org/pdf/2604.05393     GitHub
Authors: Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Jun Gao, Weiming Hu
Title: Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
Abstract: Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible, multimodal queries that combine a reference image and modification text.However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a userspecified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential.In this work, we proposeObject-AnchoredComposedImageRetrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency.To advance research on this task, we constructOACIRR(OACIRonReal-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors.Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation.To perform the OACIR task, we proposeAdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context.Extensive experiments demonstrate thatAdaFocalsubstantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.
PaperID: 735,   Poster  https://arxiv.org/pdf/2603.24912     GitHub
Authors: Jing Yang, Krithika Dharanikota, Emily Yue-ting Jia, Haiwei Chen, Yajie Zhao
Title: A Polarized Reflection and Material Dataset of Real World Objects
Abstract: Accurately modeling how realworld materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions—multiview, multi-illumination, polarization, reflectance separation, and material attributes—yielding over 1.2M high-resolution images with diffuse–specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes.
PaperID: 736,   Poster  https://arxiv.org/pdf/2512.07410     GitHub
Authors: Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Lan Xu, Jingyi Yu, Jingya Wang
Title: InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Abstract: Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to singleagent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
PaperID: 737,   Poster  https://arxiv.org/pdf/2511.20785     GitHub
Authors: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
Title: LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Abstract: Large multimodal models (LMMs) have shown great potential in video reasoning with textual Chainof-Thought. However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos—by first skimming globally and then examining relevant clips for details—we introduce LongVT, an end-to-end agentic framework that sparks "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs’ inherent temporal grounding ability as a native video cropping tool to zoom in on specific video clips and resample finer-grained frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering data for long-video reasoning, we curated and will release a data suite named VideoSIAH to facilitate both training and evaluation. Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.7K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning. Our evaluation benchmark contains 1,280 QA pairs carefully verified through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validations, LongVT consistently outperforms strong existing baselines across four challenging long-video understanding and reasoning benchmarks.
PaperID: 738,   Poster  https://arxiv.org/pdf/2512.24176     GitHub
Authors: Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu
Title: Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Abstract: The diffusion model presents a powerful ability to obtian the entire (conditional) data distribution.However, due to the lack of sufficient training and data to learn, the model will be penalized for failing to cover lowprobability areas.To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage.However, the standard CFG often leads to over-simplified or distorted samples. And the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process.This simple strategy yields significant improvements in both training efficiency and generation quality on DiTs and SiTs.On ImageNet 256×256, SiT-XL/2+IG achieves FID=5.31 and FID=1.88 which already exceeds the FID of the vanilla SiT-XL and REPA.More impressively, LightningDiT-XL/1+IG achieves FID=1.41 which achieves a large margin between all of these methods.Combined with classifier free guidance, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.23.
PaperID: 739,   Poster  https://arxiv.org/pdf/2508.01617     GitHub
Authors: XUANZHAO DONG, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang
Title: LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
Abstract: Autoregressive models (ARMs) have long dominated the landscape of biomedical visionlanguage models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. All code and model weights will be released publicly to support future research.
PaperID: 740,   Poster  https://arxiv.org/pdf/2601.05116     GitHub
Authors: Zirui Wu, Zeren Jiang, Martin R. Oswald, Jie Song
Title: From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Abstract: Feedforward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
PaperID: 741,   Poster  https://arxiv.org/pdf/2508.14036     GitHub
Authors: Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao
Title: GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Abstract: We introduce GeoSAM2, a promptcontrollable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts—clicks or boxes—to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both slow optimization-based pipelines and fast but coarse feedforward approaches. Our results highlight a new paradigm: aligning the paradigm of 3D segmentation with SAM2, leveraging interactive 2D inputs to unlock controllability and precision in object-level part understanding.
PaperID: 742,   Poster  https://arxiv.org/pdf/2603.24139     GitHub
Authors: Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye
Title: Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
Abstract: Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel TutorStudent Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a "Tutor" agent learns to guide a "Student" (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods.
PaperID: 743,   Poster  https://arxiv.org/pdf/2603.23637     GitHub
Authors: Peiyu Xu, Shuang Zhao, Xin Sun, Krishna Mullia, Raymond Fei, Iliyan Georgiev
Title: Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
Abstract: Raytracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.
PaperID: 744,   Poster  https://arxiv.org/pdf/2512.04786     GitHub
Authors: Chia-Hao Chen, Yuanchen Guo, Zi-Xin Zou, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Yan-Pei Cao, Song-Hai Zhang
Title: Lafite : A Generative Latent Field for 3D Native Texturing
Abstract: Generating detailed and seamless textures for 3D meshes remains an open challenge. Recent image and video generation models, empowered by largescale visual priors, are capable of producing highly detailed images and are thus promising for multi-view texture synthesis. However, evaluating texture quality involves multiple dimensions beyond visual fidelity. Multi-view back-projection often introduces seams and inconsistencies between different views or near occluded regions, while direct generation on UV-unwrapped maps suffers from UV distortions and ambiguities.Generating textures directly in 3D space offers an inherent advantage in ensuring continuity and spatial coherence, making it a critical and worthwhile research direction. Therefore, we systematically investigate 3D-native texture generation from the perspectives of representation and generation, and present current best practices for this approach.To this end, we employ a local vector field with a structured latent representation to model the joint distribution of texture and geometry. This design enables texture generation conditioned on high-fidelity geometric features within a unified latent space. Crucially, our approach is inherently free from occlusion artifacts, multi-view inconsistencies, and UV-related distortions caused by fragmented surface parameterizations. Extensive experiments demonstrate that our method produces high-quality, seamless textures and supports flexible downstream tasks such as editing and inpainting, marking a significant step forward in 3D-native texture generation.
PaperID: 745,   Poster  https://arxiv.org/pdf/2603.18101     GitHub
Authors: Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed
Title: Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
Abstract: Recent adapterbased CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter’s key–value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation.
PaperID: 746,   Poster  https://arxiv.org/pdf/2603.23194     GitHub
Authors: Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui
Title: PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
Abstract: Achieving realtime physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder.Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints.PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.
PaperID: 747,   Poster  https://arxiv.org/pdf/2603.01412     GitHub
Authors: Ben Kang, Jie Zhao, Xin Chen, Wanting Geng, Bin Zhang, Lu Zhang, Dong Wang, Huchuan Lu
Title: UETrack: A Unified and Efficient Framework for Single Object Tracking
Abstract: With growing realworld demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, a unified and efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed–accuracy trade-off compared to pervious methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code will be made available.
PaperID: 748,   Poster  https://arxiv.org/pdf/2603.22687     GitHub
Authors: Jiayin Sun, Caixia Sun, Boyu Yang, hailin li, Xiao Chen, Yi Zhang, Errui Ding, Liang Li, Chao Deng, Junlan Feng
Title: GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive finegrained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 × larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving.
PaperID: 749,   Poster  https://arxiv.org/pdf/2603.27969     GitHub
Authors: Pei An, Junfeng Ding, Jiaqi Yang, Yulong Wang, Jie Ma, Liangliang Nan
Title: Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Abstract: Imageto-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph framework that jointly refines cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code will be released upon publication.
PaperID: 750,   Poster  https://arxiv.org/pdf/2603.16645     GitHub
Authors: Melissa Schween, Mathis Kruse, Bodo Rosenhahn
Title: BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
Abstract: We propose Bijective Universal SceneSpecific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images.Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world.A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation.We evaluate our approach on the SARD dataset containing office and dining room scenes.Our method achieves more than 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster.Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation.This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs.Our code will be published upon acceptance on https://github.com/author/BUSSARD.
PaperID: 751,   Poster  https://arxiv.org/pdf/2601.05124     GitHub
Authors: Runze He, YIJI CHENG, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, qinglin lu, Jizhong Han, Jiao Dai
Title: Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Abstract: Incontext image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model’s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
PaperID: 752,   Poster  https://arxiv.org/pdf/2602.23734     GitHub
Authors: Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen
Title: UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Abstract: Onestream Transformer-based trackers achieve advanced performance in visual object tracking suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, a critical limitation persists: no existing work performs pruning jointly across all three critical components—the search region, dynamic template, and static template. This isolation overlooks interdependencies, yielding suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multi-modal and language-guided tasks within a single model. Comprehensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking.
PaperID: 753,   Poster  https://arxiv.org/pdf/2602.23022     GitHub
Authors: Xinglong Luo, Ao Luo, Zhengning Wang, Yueqi Yang, Chaoyu Feng, Lei Lei, Bing Zeng, Shuaicheng Liu
Title: DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flowbased image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve.Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment.Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Code and dataset will be released.
PaperID: 754,   Poster  https://arxiv.org/pdf/2603.14209     GitHub
Authors: Shishi Xiao, Tongyu Zhou, David H. Laidlaw, Gromit Yeuk-Yin Chan
Title: ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
Abstract: A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are illsuited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific method for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. The code and dataset will be released.
PaperID: 755,   Poster  https://arxiv.org/pdf/2603.09921     GitHub
Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
Title: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Abstract: Opendomain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia.Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment.In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER.WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training.Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER.
PaperID: 756,   Poster  https://arxiv.org/pdf/2603.10597     GitHub
Authors: Hao Zhou, Lu Qi, Xiangtai Li, Jie Zhang, Yi Liu, Xu Yang, Mingyu Fan, Fei Luo
Title: Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction
Abstract: Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixedlength observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code will be released.
PaperID: 757,   Poster  https://arxiv.org/pdf/2603.16769     GitHub
Authors: Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang ZHANG, Lei Zhang
Title: GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
Abstract: Recently, reinforcement learning (RL) has been employed for improving generative image superresolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models.
PaperID: 758,   Poster  https://arxiv.org/pdf/2601.01222     GitHub
Authors: Mengfei Li, Peng Li, Zheng Zhang, Jiahao Lu, Chengfeng Zhao, Wei Xue, Qifeng Liu, Sida Peng, Wenxiao ZHANG, Wenhan Luo, Yuan Liu, Yike Guo
Title: UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
Abstract: We present UniSH, a unified, feedforward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods.
PaperID: 759,   Poster  https://arxiv.org/pdf/2604.03653     GitHub
Authors: Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen
Title: Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses.To address these challenges, we propose DreamPRVR, which adopts a coarseto-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching.Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perturbation. The refined registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning.Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. The code will be released publicly.
PaperID: 760,   Poster  https://arxiv.org/pdf/2603.02554     GitHub
Authors: Chonghua Lv, Dong Zhao, Shuang Wang, Dou Quan, Ning Huyan, Nicu Sebe, Zhun Zhong
Title: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
Abstract: Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve indomain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation.
PaperID: 761,   Poster  https://arxiv.org/pdf/2603.11460     GitHub
Authors: SeungHee Choi, minju Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim
Title: Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
Abstract: Existing retrievalaugmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries.The proposed framework, STaRC, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation.We also propose to utilize the saliency scoresas a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder.By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics.
PaperID: 762,   Poster  https://arxiv.org/pdf/2604.00507     GitHub
Authors: Jihwan Park, Chanhyeong Yang, Jinyoung Park, Taehoon Song, Hyunwoo J. Kim
Title: RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
Abstract: Weakly supervised Human–Object Interaction (HOI) detection is vital for scalable scene understanding by learning interactions from only imagelevel annotations, i.e., no labels specifying which human–object instances are engaged in the interaction.Due to the lack of localization signals, prior works typically propose candidate pairs using an external object detector and then infer their interactions through pairwise reasoning.However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it exhibits suboptimal performance due to false positives arising from non-interactive combinations, hindering its capability of instance-level HOI reasoning.To this end, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module that enables efficient and accurate HOI reasoning.Under image-level supervision, RegFormer leverages spatially grounded implicit signals as guidance for the reasoning process, facilitating effective locality elicitation.Benefiting from the implicitly learned local interactions, our module can accurately distinguish humans, objects, and their interactions within their corresponding regions, enabling precise and efficient instance-level HOI reasoning without any additional training.Our extensive experiments and analysis demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even shows comparable performance compared to fully supervised models.
PaperID: 763,   Poster  https://arxiv.org/pdf/2601.20524     GitHub
Authors: Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
Title: AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
Abstract: Zeroshot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Code: \textcolormagentaUpon Acceptance
PaperID: 764,   Poster  https://arxiv.org/pdf/2512.07951     GitHub
Authors: Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen
Title: Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Abstract: Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in referenceguided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. This work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video’s expressions, lighting, and motion, while significantly reducing manual effort in production workflows.
PaperID: 765,   Poster  https://arxiv.org/pdf/2512.01329     GitHub
Authors: Hanzhi Guo, dongdong weng, SUMO SUMO, Yixiao Chen, Xiaonuo Dongye, Chenyu Xu
Title: TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
Abstract: Topologyconsistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking.
PaperID: 766,   Poster  https://arxiv.org/pdf/2603.29410     GitHub
Authors: Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang
Title: AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Abstract: Pretrained vision-language models (VLMs) have exhibited exceptional generalization capabilities in zero-shot tasks, yet remain vulnerable to adversarial examples. Conventional classification-guided adversarial fine-tuning often compromises the pre-trained cross-modal alignment, undermining the intricate visual-textual correspondence essential for zero-shot performance. To mitigate this, we introduce Alignment-Guided Fine-Tuning (AGFT), a novel framework that preserves semantic integrity while enhancing robustness. AGFT leverages the output distribution of pre-trained VLMs as the fine-tuning objective, thereby maintaining cross-modal semantic correspondence. Recognizing the divergence in feature alignment objectives between pre-trained and robust models, we further calibrate the output distribution by attenuating cross-modal feature similarity of robust models, all while safeguarding correspondence across images and diverse textual descriptions. This calibration ensures compatibility with robust feature representation without sacrificing generalization. Comprehensive experiments across diverse zero-shot datasets and settings demonstrate that AGFT achieves state-of-the-art performance, significantly improving the zero-shot adversarial robustness.
PaperID: 767,   Poster  https://arxiv.org/pdf/2505.12702     GitHub
Authors: Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu
Title: Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Abstract: Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames.To advance the task towards more practical scenarios, we introduce LongRVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing.The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships.Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency.We benchmark 7 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges.To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies.Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos. Our dataset and code will be released.
PaperID: 768,   Poster  https://arxiv.org/pdf/2603.08536     GitHub
Authors: Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
Title: SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Abstract: Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "fewshot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many)\leftrightarrowLatent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2.
PaperID: 769,   Poster  https://arxiv.org/pdf/2511.11134     GitHub
Authors: Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan
Title: GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Abstract: Unified Multimodal Models (UMMs) are redefining the landscape of artificial intelligence by coupling perception and generation across language, vision, and structured reasoning. Yet, despite their growing sophistication, a critical gap persists in evaluation: existing benchmarks largely measure discriminative understanding or unconstrained generation in isolation, overlooking the integrated generative reasoning required for genuine multimodal intelligence. To address this, we introduce GGBench, the benchmark explicitly designed to evaluate geometric generative reasoning—the ability of a model to understand, reason about, and construct a solution within a unified framework. Each instance in GGBench contains precisely aligned naturallanguage instructions, executable GeoGebra code, and rendered diagrams, enabling deterministic and interpretable verification of a model’s reasoning and constructive fidelity. The benchmark comprises 1,411 rigorously curated problems covering eight categories and multiple difficulty levels, resulting in over 7,000 aligned visualizations. We propose a comprehensive tri-modal evaluation protocol that jointly assesses textual planning quality, code executability, and geometric accuracy of generated diagrams through both automated and human-in-the-loop judging. Extensive experiments on both state-of-the-art UMMs and general Large Language Models (LLMs) reveal a large performance gap between end-to-end generation and reasoning-grounded construction. GGBench establishes a new standard for testing multimodal systems that must not only understand but also build, marking a crucial step toward grounded, verifiable generative intelligence.
PaperID: 770,   Poster  https://arxiv.org/pdf/2604.02320     GitHub
Authors: Junxuan Li, Rawal Khirodkar, Egor Zakharov, Jihyun Lee, Zhaoen Su, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Zhongshi Jiang, LINGCHEN YANG, Ariyan Zarei, Marco Pesavento, Yichen Xu, Chengan He, He Wen, Giljoo Nam, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Rinat Abdrashitov, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito
Title: Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Abstract: Highquality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
PaperID: 771,   Poster  https://arxiv.org/pdf/2512.19918     GitHub
Authors: Houston Zhang, TAO ZHANG, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, YUANHAO YU, Zhixiang Chi
Title: Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Abstract: User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, contextfree micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and UI composition modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.
PaperID: 772,   Poster  https://arxiv.org/pdf/2604.01421     GitHub
Authors: Abhishek Saroha, Huajian Zeng, Xingxing Zuo, Daniel Cremers, Xi Wang
Title: EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flowmatching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba–Transformer–Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.
PaperID: 773,   Poster  https://arxiv.org/pdf/2511.19117     GitHub
Authors: Minchong Chen, Xiaoyun Yuan, Junzhe Wan, Jianing Zhang, Jun Zhang
Title: 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
Abstract: The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal superresolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems.
PaperID: 774,   Poster  https://arxiv.org/pdf/2509.22652     GitHub
Authors: E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael Ryoo
Title: Pixel Motion Diffusion is What We Need for Robot Control
Abstract: We present DAWN (Diffusion is All We Need for robot control), a unified diffusionbased framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Visualization page: https://anonymous.4open.science/w/DAWN
PaperID: 775,   Poster  https://arxiv.org/pdf/2602.23790     GitHub
Authors: Changyu Gu, Linwei Chen, Lin Gu, Ying Fu
Title: Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
Abstract: In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduceFourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules :FAAFusionandFAA Head. FAAFusion works at the detector neck, aligning the main direction of higherlevel features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code will be public upon acceptance.
PaperID: 776,   Poster  https://arxiv.org/pdf/2603.10470     GitHub
Authors: Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini
Title: Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Abstract: While large vision–language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations—unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a trainingfree method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality.CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination.In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness.
PaperID: 777,   Poster  https://arxiv.org/pdf/2512.08826     GitHub
Authors: Shahar Sarfaty, Adi Haviv, Uri Y. Hacohen, Niva Elkin-Koren, Roi Livni, Amit H. Bermano
Title: CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
Abstract: The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a largescale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three part representation: Directions, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.
PaperID: 778,   Poster  https://arxiv.org/pdf/2603.24821     GitHub
Authors: Alabi Mehzabin Anisha, Guangjing Wang, Sriram Chellappan
Title: Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
Abstract: Stateof-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (e.g., from density map-based models to point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7× increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies.
PaperID: 779,   Poster  https://arxiv.org/pdf/2603.07918     GitHub
Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu
Title: Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
Abstract: Unregistered hyperspectral image (HSI) superresolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image.In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models.Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map.To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance.
PaperID: 780,   Poster  https://arxiv.org/pdf/2509.24979     GitHub
Authors: Haotian Dong, Wenjing Wang, Chen Li, Jing LYU, Di Lin
Title: Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
Abstract: Generating RGBA videos, which include alpha channels for transparency, has wide applications. However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB‑A distributions. We adjust both the latent space and noise space, shifting the alpha distribution outward while preserving the RGB distribution, thereby enabling stable transparency generation without compromising RGB quality. Specifically, for the latent space, we propose a transparency‑aware bidirectional diffusion loss during VAE training, which shifts the RGB‑A distribution according to likelihood. For the noise space, we propose shifting the mean of diffusion noise sampling and applying a Gaussian ellipse mask to provide transparency guidance and controllability. Additionally, we construct a high‑quality RGB‑A video dataset. Compared to state‑of‑the‑art methods, our model excels in visual quality, naturalness, transparency rendering, inference convenience, and controllability.
PaperID: 781,   Poster  https://arxiv.org/pdf/2603.01028     GitHub
Authors: Junbo Ke, Yangyang Xu, Chao Wang, You-Wei Wen
Title: Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture highfrequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods.
PaperID: 782,   Poster  https://arxiv.org/pdf/2604.05721     GitHub
Authors: Weiqi Zhang, Junsheng Zhou, Haotian Geng, Kanle Shi, Shenkun Xu, Yi Fang, Yu-Shen Liu
Title: GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
Abstract: 3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a textguided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose observing the largest un-grown regions in point clouds and inpaints them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds.
PaperID: 783,   Poster  https://arxiv.org/pdf/2603.25740     GitHub
Authors: Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li
Title: Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
Abstract: Human driving behavior is inherently personal, which is shaped by longterm habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver’s own style, highlighting personalization as a key capability for human-centered autonomous driving.
PaperID: 784,   Poster  https://arxiv.org/pdf/2603.28091     GitHub
Authors: Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegger
Title: SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Abstract: In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously.Streamingbased methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths.To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes.Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps.A dual training objective further enables consistent forecasting accuracy across diverse observation horizons.Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also the single-agent benchmarks.Moreover, our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.
PaperID: 785,   Poster  https://arxiv.org/pdf/2602.20423     GitHub
Authors: Taha Koleilat, Hojat Asgariandehkordi, Omid Nejatimanzari, Berardino Barile, Yiming Xiao, Hassan Rivaz
Title: MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Abstract: Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While visionlanguage models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation. Code and text prompts will be made publicly available upon acceptance.
PaperID: 786,   Poster  https://arxiv.org/pdf/2511.14582     GitHub
Authors: Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu, Huan Wang
Title: OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Abstract: Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audiovideo understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42× inference speedup and 1.4× memory reduction over other top-performing counterparts, while maintaining performance with no training.
PaperID: 787,   Poster  https://arxiv.org/pdf/2604.20319     GitHub
Authors: Gui Wang, YongSong Zhou, Kaijun Deng, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen
Title: SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
Abstract: Finegrained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduceSurgCoT,a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across7 surgical specialtiesand35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue–Action Alignment, Affordance Mapping, Micro‑Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question → Option → Knowledge → Clue → Answer), where theKnowledgefield provides essential background context andClueprovides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code and data will be released.
PaperID: 788,   Poster  https://arxiv.org/pdf/2603.05929     GitHub
Authors: Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang
Title: Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
Abstract: Vision Transformers (ViTs) have recently achieved stateof-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Source code will be released for research purposes.
PaperID: 789,   Poster  https://arxiv.org/pdf/2603.27455     GitHub
Authors: Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk
Title: From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Abstract: In this paper, we introduce NAS3R, a selfsupervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D learning from unconstrained data.
PaperID: 790,   Poster  https://arxiv.org/pdf/2510.18632     GitHub
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Title: Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Abstract: Though recent advances in vision–language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploit the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning.
PaperID: 791,   Poster  https://arxiv.org/pdf/2509.25934     GitHub
Authors: Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao
Title: UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Abstract: Existing anomaly detection methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstructionbased multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates.To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains.This process is guided by a ``general → specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning.In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and an MoE-in-MoE structure, reducing MoE parameter usage by approximately 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The code will be released upon publication.
PaperID: 792,   Poster  https://arxiv.org/pdf/2603.08309     GitHub
Authors: Yehonatan Elisha, Oren Barkan, Noam Koenigstein
Title: Concept-Guided Fine-tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Abstract: Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground/background masks, overlook the finegrained semantic concepts that truly define an object (e.g., "long beak" and "wings" for a "bird"). To address this, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps (via AttnLRP) to align with spatially-grounded concept masks. These guidance masks are generated automatically and without manual annotation: class-relevant concepts are first proposed using an LLM-driven, label-free method, and then segmented using a Vision-Language Model (GroundingSAM). The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas and preserving classifier confidence via a dedicated loss term. This process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution (OOD) benchmarks, show that our method significantly enhances model robustness across multiple ViT-based models and an additional CNN model. Furthermore, we validate that the resulting relevance maps exhibit improved alignment with semantic object parts, providing a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective guidance for model robustness than conventional segmentation maps, validating our hypothesis.
PaperID: 793,   Poster  https://arxiv.org/pdf/2603.07911     GitHub
Authors: Hui Liu, Kecheng Chen, Jialiang Wang, Xianming Liu, Wenya Wang, Haoliang Li
Title: Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
Abstract: VisionLanguage Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification.
PaperID: 794,   Poster  https://arxiv.org/pdf/2603.24146     GitHub
Authors: Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, Kyungdon Joo
Title: LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
Abstract: Openvocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchmarks, including DL3DV-OVS with large and complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50× faster speed and 64× lower memory, offering a scalable foundation for real-time language-driven 3D understanding.
PaperID: 795,   Poster  https://arxiv.org/pdf/2603.22042     GitHub
Authors: Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
Title: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Abstract: While VisionLanguage Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized with entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks.
PaperID: 796,   Poster  https://arxiv.org/pdf/2509.21953     GitHub
Authors: Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li
Title: MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
Abstract: Multisubject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.
PaperID: 797,   Poster  https://arxiv.org/pdf/2603.10990     GitHub
Authors: Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei
Title: Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
Abstract: Recent advances in textto-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images.To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial–temporal guidance scale in generation, thereby enhancing color authenticity.Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. All datasets and code will be publicly released.
PaperID: 798,   Poster  https://arxiv.org/pdf/2603.20818     GitHub
Authors: Hanqiao Ye, Yuzhou Liu, Yangdong Liu, Shuhan Shen
Title: PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-based Structure Matching
Abstract: While structurebased relocalizers have long strived forpointcorrespondences when establish or regress query-map associations, in this paper, we pioneer the use ofplanar primitivesand planar 3D maps for lightweight 6-DoF camera relocalization in structured environments.Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness.This motivates us to introducePlanaReLoc, a streamlined "plane-centric" paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework.Through extensive experiments on theScanNetand12Scenesdatasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modality structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training.The code and data will be released.
PaperID: 799,   Poster  https://arxiv.org/pdf/2602.22654     GitHub
Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Title: Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multistep iterative sampling.Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep.Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity.During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by +0.031 ImageReward at 4.87× speedup and even surpassing the full-step baseline by +0.028 ImageReward at 3.54× speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework.Code is provided in supplementary materials and will be released upon acceptance, with support for mainstream models.
PaperID: 800,   Poster  https://arxiv.org/pdf/2603.07789     GitHub
Authors: Zixuan Pan, Kaiyuan Tang, Jun Xia, Yifan Qin, Lin Gu, Chaoli Wang, Jianxu Chen, Yiyu Shi
Title: SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
Abstract: 2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on lowend devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5× compression over prior non-quantized 2D Gaussian methods and 1.6× over quantized ones, while also delivering 1.6× and 6.5× faster optimization, respectively, without degrading, and often improving, image fidelity. Uploaded code will be released upon acceptance.
PaperID: 801,   Poster  https://arxiv.org/pdf/2603.00431     GitHub
Authors: Hulingxiao He, Zhi Tan, Yuxin Peng
Title: Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
Abstract: A highperforming, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that \ours~consistently enhances LMMs’ hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies.
PaperID: 802,   Poster  https://arxiv.org/pdf/2603.23324     GitHub
Authors: Chuanqing Zhuang, Xin Lu, Zehui Deng, Zhengda Lu, Yiqun Wang, Junqi Diao, Jun Xiao
Title: Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Abstract: Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a posefree omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D–3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos.
PaperID: 803,   Poster  https://arxiv.org/pdf/2602.06959     GitHub
Authors: Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Yu Wang, Xihui Liu
Title: CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
Abstract: Cinematic video production requires control over scenesubject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
PaperID: 804,   Poster  https://arxiv.org/pdf/2603.18461     GitHub
Authors: Kazuya Nishimura, Ryoma Bise, Shinnosuke Matsuo, Haruka Hirose, Yasuhiro Kojima
Title: Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
Abstract: Estimating slideand patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression.To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes—mean expression profiles that capture stable gene–gene co-variation patterns. CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework.We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression.The code will be made publicly available upon acceptance.
PaperID: 805,   Poster  https://arxiv.org/pdf/2512.13495     GitHub
Authors: Jiangning Zhang, junwei zhu, Zhenye Gan, Donghao Luo, Chuming Lin, FeiFan Xu, Xu Peng, Jianlong Hu, Yuansen Liu, Yijia Hong, Weijian Cao, Han Feng, Xu Chen, Chencan Fu, Keke He, Xiaobin Hu, Chengjie Wang
Title: Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
Abstract: We propose a multimodaldriven framework for high-fidelity long-term digital human animation termed Soul, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4× speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video–text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production.
PaperID: 806,   Poster  https://arxiv.org/pdf/2603.24484     GitHub
Authors: SIQI LIU, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen
Title: Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
Abstract: As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a humanlike Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human–AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark—an egocentric, real-world video dataset for ToM with three multiple-choice QA settings—demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine–human collaboration toward greater alignment.
PaperID: 807,   Poster  https://arxiv.org/pdf/2604.06824     GitHub
Authors: Subin Park, Jung Uk Kim
Title: Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
Abstract: Audio–Visual Sound Source Localization (SSL) aims to identify the locations of soundemitting objects by leveraging correlations between audio and visual modalities. Existing SSL methods often rely on contrastive learning–based feature matching but lack explicit reasoning and verification stages, limiting their effectiveness in complex acoustic scenes. Inspired by human metacognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies audio–visual consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source (VGGSound-Single, MUSIC-Solo) and multi-source (VGGSound-Duet, MUSIC-Duet) benchmarks demonstrate competitive performance. The source code will be publicly available.
PaperID: 808,   Poster  https://arxiv.org/pdf/2603.04975     GitHub
Authors: Zishu Yao, Xiang-Xiang Su, Shengning Zhou, Guang-Yong Chen, Guodong Fan, Xing Chen
Title: BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Abstract: Event cameras, with their high dynamic range, show great promise for Lowlight Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage—which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective—we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SED demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR , 2.03dB in PSNR and 0.047 in SSIM, respectively.
PaperID: 809,   Poster  https://arxiv.org/pdf/2602.21760     GitHub
Authors: Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee
Title: Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
Abstract: Diffusion models have achieved remarkable progress in highfidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves 2.31× and 2.07× latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://anonymous.4open.science/r/hybrid-diffusion/.
PaperID: 810,   Poster  https://arxiv.org/pdf/2506.20380     GitHub
Authors: Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, Toby Jackson, James Ball, David Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav
Title: TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Abstract: Satellite Earthobservation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance to the selection of valid observations, aided by two key regularizers: global shuffling to decorrelate spatial neighborhoods and mix-based regulation for invariance under extreme sparsity. We find that for diverse classification, segmentation, and regression tasks, TESSERA embeddings deliver state-of-the-art accuracy with high label efficiency, often requiring only a small task head and minimal computation. To democratize access, adhere to FAIR principles, and simplify use, we release global, annual, 10m, pixel-wise int8 embeddings together with open weights/code and lightweight adaptation heads, providing practical tooling for large-scale retrieval and inference at planetary scale.
PaperID: 811,   Poster  https://arxiv.org/pdf/2510.26865     GitHub
Authors: Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Chen, Jin-Ge Yao, Xi Yang
Title: Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Abstract: Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current visionlanguage models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. We have also conducted preliminary experiments with reinforcement finetuning (RFT) over synthetic data, and find a significant improvement on both in-domain synthetic subset and real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource and our code releases can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
PaperID: 812,   Poster  https://arxiv.org/pdf/2512.09928     GitHub
Authors: Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, Donglin Wang
Title: HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action models
Abstract: VisionLanguage-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. Existing attempts to incorporate history by stacking frames are computationally expensive and redundant. We argue that motion provides a more compact and informative representation of temporal context, capturing inter-state dynamics while filtering static noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable “think-while-acting” control. Extensive experiments show that HiF-VLA improves performance from 94.0% to 96.4% on LIBERO-Long and 4.10 to 4.35 on CALVIN ABC-D, surpassing strong baselines. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in real-world long-horizon settings.
PaperID: 813,   Poster  https://arxiv.org/pdf/2506.13723     GitHub
Authors: Zhanxuan Hu, Xu Qiyu, Yu Duan, Yonghang Tai, Huafeng Li
Title: SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Abstract: Foundation models have attracted widespread attention across domains due to their powerful zeroshot classification capabilities. This work is motivated by two key observations: (1) Vision-Language Models (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas Vision-only Foundation Models (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose SOTA (Self-adaptive Optimal TrAnsport), a training-free ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, SOTA requires no hyperparameter tuning and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of SOTA. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. All codes are provided in the supplementary materials and will be released upon the acceptance of this paper.
PaperID: 814,   Poster  https://arxiv.org/pdf/2603.28224     GitHub
Authors: Kazuma Ikeda, Ryosei Hara, Rokuto Nagata, Ozora Sako, Zihao Ding, Takahiro Kado, Ibuki Fujioka, Taro Beppu, Mariko Isogawa, Kentaro Yoshioka
Title: Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
Abstract: LiDAR has become an essential sensing modality in autonomous driving, robotics, and smartcity applications. However, ghost points (or ghost), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal rely on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100× larger than existing annotated FWL datasets. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhance downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50× false positive reduction).
PaperID: 815,   Poster  https://arxiv.org/pdf/2509.18154     GitHub
Authors: Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Ranchi Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun
Title: MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Abstract: Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPMV 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.
PaperID: 816,   Poster  https://arxiv.org/pdf/2602.19248     GitHub
Authors: Zunkai Dai, Ke Li, JIAJIA LIU, Jie Yang, Yuanyuan Qiao
Title: No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
Abstract: The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatiotemporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in the supplementary materials.
PaperID: 817,   Poster  https://arxiv.org/pdf/2509.00789     GitHub
Authors: Pei Liu, Qingtian Ning, Xinyan Lu, Haipeng LIU, Weiliang Ma, Dangen She, XianPeng Lang, Jun Ma
Title: CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
Abstract: The pursuit of autonomous agents with predictive cognitive world models is hindered by a fundamental flaw in current visionlanguage models (VLMs): they lack cognitive inertia. Operating on isolated snapshots, these models cannot form a temporally coherent world view, leading to erratic decision jitter and a failure to execute complex, multi-step maneuvers. To remedy this, we introduce CogDriver, a framework designed to build a coherent world model by instilling this crucial cognitive property. Our work makes two key contributions: (1) We present CogDriver-Data, a large-scale vision-language-action dataset whose narrative annotations provide the supervisory signal for learning the temporal dynamics of a world model. (2) We develop the CogDriver-Agent, an architecture featuring a sparse temporalmemory to maintain a stable internal state, the foundation of a world model. This is enabled by a spatiotemporal knowledge distillation approach that explicitly teaches decision coherence. Comprehensive experiments validate our paradigm: CogDriver-Agent achieves a 22% increase in the closed-loop Driving Score on Bench2Drive and a 21% reduction in mean L2 error on nuScenes, establishing a new state-of-the-art. These significant gains in both long-term decision-making and imitation accuracy provide strong evidence that our agent is developing a more stable internal world model.
PaperID: 818,   Poster  https://arxiv.org/pdf/2604.04348     GitHub
Authors: Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian
Title: OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
Abstract: In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both onscreen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching–based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech–environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Our data, code, and pre-trained models will be released.
PaperID: 819,   Poster  https://arxiv.org/pdf/2512.05076     GitHub
Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
Title: BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4Dcontrollable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional embedding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability.
PaperID: 820,   Poster  https://arxiv.org/pdf/2602.20901     GitHub
Authors: Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song
Title: SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Abstract: VisionLanguage Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. We will release our code and dataset soon.
PaperID: 821,   Poster  https://arxiv.org/pdf/2604.09989     GitHub
Authors: yuchen zou, Huikai Shao, Lihuang Fang, Zhipeng Xiong, Dexing Zhong
Title: FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation
Abstract: Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an opticalflow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Notably, FlowPalm achieves a higher TAR at FAR=1e-4 than the best generative model does at FAR=1e-3.
PaperID: 822,   Poster  https://arxiv.org/pdf/2602.21877     GitHub
Authors: Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci
Title: How to Take a Memorable Picture? Empowering Users with Actionable Feedback
Abstract: Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task ofMemorabilityFeedback (MemFeed), where an automated model should provide actionable, humaninterpretable guidance to users with the goal to enhance an image future recall. We also presentMemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., “emphasize facial expression,” “bring the subject forward”). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduceMemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators. Dataset and code will be publicly released upon publication.
PaperID: 823,   Poster  https://arxiv.org/pdf/2506.17629     GitHub
Authors: Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang
Title: CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Abstract: Embodied Visual Reasoning (EVR) seeks to follow complex, freeform instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end‑to‑end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Considering the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM‑driven open‑world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long‑term visual dependencies. Code will be available after review.
PaperID: 824,   Poster  https://arxiv.org/pdf/2604.15756     GitHub
Authors: Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu, Jia-Xin ZHUANG, Zhiyong Gan, Ruixuan Wang
Title: TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
Abstract: Visionlanguage models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels.However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams.To address this limitation, we introduceTest-timeTextualLearning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels.TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise.In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches.Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Code will be released publicly upon acceptance.
PaperID: 825,   Poster  https://arxiv.org/pdf/2511.21136     GitHub
Authors: Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi, Zhihui Li, Xiaojun Chang
Title: Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on highresolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2× training speedup and 2.4× GPU memory reduction without compromising generative performance.
PaperID: 826,   Poster  https://arxiv.org/pdf/2604.20350     GitHub
Authors: Gui Wang, Zehao Zhong, YongSong Zhou, Yudong Li, Ende Wu, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen
Title: X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
Abstract: Despite significant progress in Multimodal Large Language Models (MLLMs), their clinical reasoning capacity in complex multi-modal diagnostic scenarios remains largely unexamined. Current benchmarks, predominantly limited to single-modality data, lack the capacity to evaluate progressive reasoning and cross-modal integration essential for clinical practice. To bridge this gap, we introduceCross-Modality Progressive Clinical Reasoning(X-PCR) benchmark, the first comprehensive evaluation framework for MLLMs spanning the complete ophthalmology diagnostic workflow. X-PCR incorporates two core reasoning tasks: 1) asix-stage progressive reasoning chainspanning image quality assessment to clinical decision-making, and 2) Across-modality reasoning taskintegrating six ophthalmic imaging modalities. The benchmark comprises26,415 imagesand177,868 expert-verified VQA pairscurated from 51 public datasets, covering 52 ophthalmic diseases. Our evaluation of21 leading MLLMsreveals critical gaps in progressive reasoning and cross-modal integration. X-PCR establishes a unified benchmark to advance MLLMs from task-specific performance to comprehensive diagnostic capability through aligned multi-modal clinical data. Dataset and code will be publicly released.
PaperID: 827,   Poster  https://arxiv.org/pdf/2509.15695     GitHub
Authors: Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Biyik, Hao Su
Title: ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Abstract: Large VisionLanguage Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs.
PaperID: 828,   Poster  https://arxiv.org/pdf/2603.09506     GitHub
Authors: Won Shik Jang, Ue-Hwan Kim
Title: Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Abstract: Textgoal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present Context-Nav that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers---guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or finetuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This supports that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
PaperID: 829,   Poster  https://arxiv.org/pdf/2602.06965     GitHub
Authors: Ankan Deria, Komal Kumar, Adinath Dukre, Eran Segal, Salman Khan, Imran Razzak
Title: MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
Abstract: Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on largescale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.8% over the baseline and performs within 1.8% of the SOTA Fleming-VL. For text-based QA, it attains +7.0% over the baseline and +14.6% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance.Evaluations across radiology, ophthalmology, pathology, and emergency care confirm MedMO’s broad cross-modality generalization and reliable spatial reasoning. Our code, data, and models will be publicly available.
PaperID: 830,   Poster  https://arxiv.org/pdf/2601.09575     GitHub
Authors: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
Title: OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Abstract: We propose OpenVoxel, a trainingfree algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
PaperID: 831,   Poster  https://arxiv.org/pdf/2603.24097     GitHub
Authors: Haoyu Ji, Xueting Liu, Yu Gao, Wenze Huang, Zhihao Yang, Weihong Ren, Zhiyong Wang, Honghai LIU
Title: LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
Abstract: Skeletonbased Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation.
PaperID: 832,   Poster  https://arxiv.org/pdf/2512.10881     GitHub
Authors: Kehong Gong, Zhengyu Wen, Weixia He, Xu Mingxi, Qi WANG, ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, He Xiaoyu, Mingyuan Zhang
Title: MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Abstract: Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain speciesor template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: aReference Prompt Encoderthat distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; aVideo Feature Extractorthat computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud–like joint space; and aUnified Motion Decoderthat fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton–mesh–rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that \name delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.
PaperID: 833,   Poster  https://arxiv.org/pdf/2506.04764     GitHub
Authors: Suhan Woo, Seongwon Lee, jinwoo jang, Euntai Kim
Title: HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
Abstract: Visual environments are inherently hierarchical, as a panoramic view naturally encompasses and organizes multiple perspective views within its field. Capturing this hierarchy is crucial for effective perspectiveto-equirectangular (P2E) visual place recognition. In this work, we introduce HypeVPR, a hierarchical embedding framework in hyperbolic space specifically designed to address the challenges of P2E matching. HypeVPR leverages the intrinsic ability of hyperbolic space to represent hierarchical structures, allowing panoramic descriptors to encode both broad contextual information and fine-grained local details. To this end, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Furthermore, HypeVPR’s hierarchical organization inherently enables flexible control over the accuracy–efficiency trade-off without additional training, while maintaining robust matching across different image types. This approach allows HypeVPR to outperform existing methods while significantly accelerating retrieval and reducing database storage requirements. The codes and models are available: TBD.
PaperID: 834,   Poster  https://arxiv.org/pdf/2603.18671     GitHub
Authors: J. Miguel Valverde, Dim Papadopoulos, Rasmus Larsen, Anders Dahl
Title: Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
Abstract: Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorestclassified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://github.com/Anonymous.
PaperID: 835,   Poster  https://arxiv.org/pdf/2603.23276     GitHub
Authors: Yuchen Wu, Kun Wang, Yining Pan, Na Zhao
Title: CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Abstract: Multimodal approaches have emerged as a promising paradigm for accurate 3D object detection. However, performance degrades precipitously when deployed in target domains divergent from the training distribution. In this work, we pinpoint two primary culprits: 1) In certain domains, such as nighttime or rainy conditions, one modality experiences significant degradation;2) The LiDAR branch tends to dominate the detection process, resulting in systematic under-exploitation of visual cues and vulnerability when point clouds are compromised. To surmount these impediments, we propose three synergistic innovations. First, Query-Decoupled Loss imparts independent supervisory signals to 2D-only, 3D-only, and fused queries, ensuring equitable gradient propagation and mitigating the image branch's supervisory starvation. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors via probabilistic depth distribution fusion from the point cloud, enhancing their spatial reasoning. Third, Inconsistent Cross-Modal Masking applies complementary spatial masks to the image and point cloud, simulating modality-specific failures and compelling queries from both modalities to compete within the fused decoder, thereby promoting adaptive fusion and preventing over-reliance on a single sensor.Extensive experiments reveal substantial gains over state-of-the-art baselines, achieving mAP improvements of 2.8, 1.3, and 3.2 on the Rain, Night, and Boston domains, respectively, while preserving competitive source-domain efficacy.
PaperID: 836,   Poster  https://arxiv.org/pdf/2511.22235     GitHub
Authors: Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu
Title: Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
Abstract: The rapid development of large visionlanguage model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management.Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors.
PaperID: 837,   Poster  https://arxiv.org/pdf/2603.25778     GitHub
Authors: Yuan Zhang, Sihao Dou, Kai Hu, Shuhua Deng, Chunhong Cao, Fen Xiao, Xieping Gao
Title: Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Abstract: Endoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited highquality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we proposeFocus-to-PerceiveRepresentationLearning (FPRL), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos.FPRLfirst focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this,FPRLemploys a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics through the application of teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show thatFPRLachieves state-of-the-art performance across diverse downstream tasks, demonstrating its effectiveness and strong generalization in endoscopic video representation learning.
PaperID: 838,   Poster  https://arxiv.org/pdf/2512.06281     GitHub
Authors: Hengzhuang Li, Xinsong Zhang, QIMING PENG, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin
Title: Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks.Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations.This issue stems from the predominant reliance on nexttext-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers.To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM.Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities.Code will be publicly available upon publication.
PaperID: 839,   Poster  https://arxiv.org/pdf/2603.17370     GitHub
Authors: Umangi Jain, Vladimir G. Kim, Matheus Gadelha, Igor Gilitschenski, Zhiqin Chen
Title: Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
Abstract: We introduce the problem of materialaware part grouping in untextured meshes.Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations.When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming.To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties -- when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context.We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials;therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part.To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries.We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.
PaperID: 840,   Poster  https://arxiv.org/pdf/2603.07625     GitHub
Authors: Shumeng Li, Jintao Guo, Jian Zhang, Yulin Zhou, Luyang Cao, Yinghuan Shi
Title: Duala: Dual-level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
Abstract: Crosssubject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Daula effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction.
PaperID: 841,   Poster  https://arxiv.org/pdf/2603.16001     GitHub
Authors: Sijie Li, Biao Qian, Jungong Han
Title: Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Abstract: Network pruning is an effective technique for enabling lightweight Large VisionLanguage Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.
PaperID: 842,   Poster  https://arxiv.org/pdf/2512.01677     GitHub
Authors: Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong, Zehang Luo, Chengxi Heyu, Junfeng Li, Wenxuan Song, Shunbo Zhou, Haoang Li
Title: Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
Abstract: Generating realistic handobject interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design.
PaperID: 843,   Poster  https://arxiv.org/pdf/2512.02392     GitHub
Authors: Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun
Title: From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
Abstract: Endto-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The source code will be made publicly available upon publication.
PaperID: 844,   Poster  https://arxiv.org/pdf/2512.06065     GitHub
Authors: Runjia Li, Moayed Haji Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip H.S. Torr, Willi Menapace
Title: EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
Abstract: We study instructionguided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges — including rapid egomotion, and frequent hand–object interactions — that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction.To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion.Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks—where existing methods struggle—while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community.
PaperID: 845,   Poster  https://arxiv.org/pdf/2604.13994     GitHub
Authors: Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao
Title: Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
Abstract: Generative diffusion priors have recently achieved stateof-the-art performance in Natural Image Super-Resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to Remote Sensing Image Super-Resolution (RSISR) reveals significant shortcomings. Remote sensing images present a unique challenge: ground objects often exhibit globally stochastic yet locally clustered patterns. This characteristic leads to highly imbalanced texture distributions, posing a significant hurdle to the model's spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) that reflects the underlying texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance.
PaperID: 846,   Poster  https://arxiv.org/pdf/2511.19172     GitHub
Authors: Kehua Chen, Tianlu Mao, Xinzhu Ma, Hao Jiang, Zehao Li, Zihan Liu, Shuqin Gao, Honglong Zhao, Feng Dai, Yucheng Zhang, Zhaoqi Wang
Title: MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Abstract: Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in largescale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction. The code will be publicly released upon acceptance.
PaperID: 847,   Poster  https://arxiv.org/pdf/2604.00479     GitHub
Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Henry Tu, Jing Zhang
Title: All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
Abstract: Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of VisionLanguage Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks.
PaperID: 848,   Poster  https://arxiv.org/pdf/2602.15031     GitHub
Authors: Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou, Carl Marshall, Shubham Tulsiani, Caleb Leak
Title: EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Abstract: Highfidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10× more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and real-time autoregressive content propagation.
PaperID: 849,   Poster  https://arxiv.org/pdf/2604.19318     GitHub
Authors: Qi Zhang, Jixuan Chen, Zhang Kaiyi, Xinquan Yu, Antoni B. Chan, Hui Huang
Title: Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Abstract: Multiview crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, MVTrackTrans, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code will be made public upon paper acceptance.
PaperID: 850,   Poster  https://arxiv.org/pdf/2512.01540     GitHub
Authors: Zipeng Wang, Dan Xu
Title: FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
Abstract: 3D reconstruction from multiview images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set ofdescriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead.Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just9.3%for 1,000 images, and scaling efficiently to sequences exceeding3,000images.
PaperID: 851,   Poster  https://arxiv.org/pdf/2511.20415     GitHub
Authors: Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han
Title: MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, finegrained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be made publicly available.
PaperID: 852,   Poster  https://arxiv.org/pdf/2511.20223     GitHub
Authors: Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen
Title: V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional taskspecific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding.
PaperID: 853,   Poster  https://arxiv.org/pdf/2508.04416     GitHub
Authors: Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, bowen zhang, zhichao zhou, Dongliang He, Yansong Tang
Title: Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored textbased chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on eleven challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios.
PaperID: 854,   Poster  https://arxiv.org/pdf/2604.14563     GitHub
Authors: Mingqian Ji, Shanshan Zhang, Jian Yang
Title: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Abstract: Vision Transformer (ViT)based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to57%faster inference than the StreamPETR baseline and20%higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy.
PaperID: 855,   Poster  https://arxiv.org/pdf/2603.19026     GitHub
Authors: Anqi Zhang, Xiaokang Ji, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei
Title: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
Abstract: Recent segmentation methods leveraging Multimodal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding~(SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without MLLM processing, respectively, to unleash the details of compressed features and simulate the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs.
PaperID: 856,   Poster  https://arxiv.org/pdf/2510.18457     GitHub
Authors: Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng
Title: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFMVAE). To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10× speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.
PaperID: 857,   Poster  https://arxiv.org/pdf/2512.18766     GitHub
Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Jie Huang, Feng Zhao
Title: MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
Abstract: Reinforcement learning (RL) has demonstrated significant potential for posttraining language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging.The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results.In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps.Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image.Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them.Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy.Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
PaperID: 858,   Poster  https://arxiv.org/pdf/2601.14959     GitHub
Authors: Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
Title: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
Abstract: Existing video frame interpolation (VFI) methods often adopt a framecentric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion.
PaperID: 859,   Poster  https://arxiv.org/pdf/2604.01082     GitHub
Authors: Yaoqin Ye, Yiteng Xu, Qin Sun, Xinge Zhu, YUJING SUN, Yuexin Ma
Title: ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
Abstract: Human behaviors in realworld environments are inherently interactive, with an individual’s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human–robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego’s future motion from dynamic multi-source cues, including others’ actions, scene geometry, and semantic inputs.This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human–human, and human–scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. During online rollout, ReMoGen performs generation in short temporal segments and employs a lightweight Frame-wise Segment Refinement module that incorporates freshly observed interaction cues, achieving responsive and temporally coherent motion without heavy full-sequence inference. Extensive experiments across human–human, human–scene, and composite interaction settings demonstrate that ReMoGen delivers superior motion fidelity, responsiveness, and cross-domain generalization.
PaperID: 860,   Poster  https://arxiv.org/pdf/2603.17312     GitHub
Authors: Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang
Title: Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
Abstract: Accurately estimating task progress is critical for embodied agents to plan and execute longhorizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model (\textR^2VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train \textR^2VLM on large-scale, automatically generated datasets from ALFRED and Ego4D, enhanced with advanced post-training techniques. Extensive experiments on progress estimation and downstream applications—including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance—demonstrate that \textR^2VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation.
PaperID: 861,   Poster  https://arxiv.org/pdf/2603.01194     GitHub
Authors: Mochu Xiang, Zhelun Shen, Xuesong li, Jiahui Ren, Jing Zhang, Chen Zhao, Shanshan Liu, Haocheng Feng, Jingdong Wang, Yuchao Dai
Title: RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
Abstract: Human perceive the 3D world through 2D observations from limited viewpoints. While recent feedforward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications.
PaperID: 862,   Poster  https://arxiv.org/pdf/2603.12254     GitHub
Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Jan Kautz, Boyi Li, David Chan, Trevor Darrell, Pavlo Molchanov, Danny Yin
Title: Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Abstract: Multimodal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos---they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that reconstructs the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 66.5% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with multi-minute 4K videos, where an MLLM scaled with AutoGaze outperform the previous SOTA MLLM by 6.3%.
PaperID: 863,   Poster  https://arxiv.org/pdf/2603.19718     GitHub
Authors: Phuong Nguyen, Tien Anh Pham, Duc-Trong Le, Van Nguyen
Title: BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
Abstract: Learning from multiple modalities often suffers from imbalance, where informationrich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing modalities (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework consists of two complementary modules. The Feature Calibration Module (FCM) operates at the representation level, recalibrating unimodal features through global contextual information to build a shared representation basis across heterogeneous missing patterns. The Gradient Rebalancing Module (GRM) works at the optimization level, equalizing learning dynamics across modalities by modulating gradient magnitudes and directions from distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings.
PaperID: 864,   Poster  https://arxiv.org/pdf/2602.22212     GitHub
Authors: Julian Kaltheuner, Hannah Dröge, Markus Plack, Patrick Stotko, Reinhard Klein
Title: Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
Abstract: emporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand categoryspecific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60× faster than existing training-free methods and even matching the inference-time performance of heavy pretrained models.
PaperID: 865,   Poster  https://arxiv.org/pdf/2511.13269     GitHub
Authors: Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding
Title: Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Abstract: VisionLanguage Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments.To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories—Environmental Perception and Scene Understanding—divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others.Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities.To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1 M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts.Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for drone scenarios.The dataset, benchmark toolkit, and associated code and model checkpoints will be publicly accessible.
PaperID: 866,   Poster  https://arxiv.org/pdf/2512.05131     GitHub
Authors: Tianling Xu, Shengzhe GAN, Leslie Gu, Yuelei Li, Fangneng Zhan, Hanspeter Pfister
Title: AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
Abstract: Active 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from precollected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose AREA3D, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replica and OmniObject3D) demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, especially in sparse views.
PaperID: 867,   Poster  https://arxiv.org/pdf/2603.01000     GitHub
Authors: Li Yuze, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang
Title: Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on singleobject scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer. All code, models, and benchmarks will be publicly available.
PaperID: 868,   Poster  https://arxiv.org/pdf/2603.19609     GitHub
Authors: Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Yu Liu, Maojun Zhang, Shen Yan
Title: LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Abstract: We present LoDLoc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance-level building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin.
PaperID: 869,   Poster  https://arxiv.org/pdf/2604.12221     GitHub
Authors: Qingyuan Cai, Saihui Hou, Xuecai Hu, Yongzhen Huang
Title: BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
Abstract: Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where realworld subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research. The dataset, code, and models will be released upon acceptance.
PaperID: 870,   Poster  https://arxiv.org/pdf/2601.06874     GitHub
Authors: Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao
Title: MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
Abstract: Most existing 3D referring expression segmentation (3DRES) methods rely on dense, highquality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.
PaperID: 871,   Poster  https://arxiv.org/pdf/2512.19486     GitHub
Authors: Shaochen Bi, Yuting He, Weiming Wang, Hao Chen
Title: Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
Abstract: Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR process two images simultaneously as input, the combination relationship between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website.
PaperID: 872,   Poster  https://arxiv.org/pdf/2508.07341     GitHub
Authors: Fangtai Wu, Mushui Liu, Weijie He, Zhao Wang, Yunlong Yu
Title: DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized.Existing customization approaches for unified AR models face a fundamental dilemma: adaptationbased methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization.To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles.Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters.
PaperID: 873,   Poster  https://arxiv.org/pdf/2512.15524     GitHub
Authors: Yuxiang Shi, Zhe Li, Yanwen Wang, Hao Zhu, Xun Cao, Ligang Liu
Title: DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
Abstract: Portrait animation from a single source image and a driving video is a longstanding problem.Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation.However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation.To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals.Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code.First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals.Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention.Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency.Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
PaperID: 874,   Poster  https://arxiv.org/pdf/2604.09480     GitHub
Authors: Shunkai Zhou, Zike Yan, fei xue, Dong Wu, Yuchen Deng, Hongbin Zha
Title: Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
Abstract: We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a localglobal self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks.
PaperID: 875,   Poster  https://arxiv.org/pdf/2512.12751     GitHub
Authors: Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, peng siyi, Bailan Feng, Xiang Bai, Hengshuang Zhao
Title: GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
Abstract: Physicsaware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
PaperID: 876,   Poster  https://arxiv.org/pdf/2509.22496     GitHub
Authors: Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Laiyuan Wang, Hua Zhang, Xiaochun Cao
Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight blackbox framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.
PaperID: 877,   Poster  https://arxiv.org/pdf/2603.23227     GitHub
Authors: Qinglun Zhang, Shen Cheng, Tian Dan, Haoqiang Fan, Guanghui Liu, Shuaicheng Liu
Title: Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
Abstract: While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on singlemodality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen benchmark and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7× inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code and videos will be released.
PaperID: 878,   Poster  https://arxiv.org/pdf/2602.19418     GitHub
Authors: Hefei Mei, Zirui Wang, Chang Xu, Jianyuan Guo, Minjing Dong
Title: PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
Abstract: Large Vision–Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior whitebox attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token‑level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA‑Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs.
PaperID: 879,   Poster  https://arxiv.org/pdf/2603.16616     GitHub
Authors: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello
Title: ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
Abstract: We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all landcover classes with shared boundaries and no gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that guarantees shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512, e.g., compared to TopDiG on vegetation, +9.9 IoU (semantic fidelity), -45% PoLiS (geometric error), -59% N-ratio (vertex redundancy). It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be publicly released.
PaperID: 880,   Poster  https://arxiv.org/pdf/2603.12912     GitHub
Authors: Xin Xu, Weilong Li, Wei Liu, Wenke Huang, Zhixi Yu, Bin Yang, Xiaoying Liao, Kui Jiang
Title: FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
Abstract: Federated Domain Generalization for Person ReIdentification (FedDG-ReID) aims to learn domain-invariant representations from decentralized data. Although Vision Transformers (ViTs) are widely adopted, their global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints---a challenge further amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), which introduces learnable visual prompts to explicitly guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) that divides prompts into two groups: Holistic Full Body Prompts suppress cross-client background noise, while Body Part Alignment Prompts capture fine-grained details robust to pose and viewpoint variations. To mitigate the high communication cost of large Transformer models, we further design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTSachieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code will be released.
PaperID: 881,   Poster  https://arxiv.org/pdf/2603.17809     GitHub
Authors: Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, yi chen, Peipei Yang, Xu-Yao Zhang
Title: Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Abstract: Large Vision Language Models (LVLMs) have achieved remarkable success in a wide range of downstream tasks that require multimodal interaction, but their powerful capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, posttraining quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the rich and complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy based on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method consistently improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code will be released upon acceptance.
PaperID: 882,   Poster  https://arxiv.org/pdf/2512.02697     GitHub
Authors: Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du
Title: GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Abstract: Crossview geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models will be released.
PaperID: 883,   Poster  https://arxiv.org/pdf/2303.08250     GitHub
Authors: Chinmay Savadikar, Michelle Dai, Tianfu Wu
Title: Continual Learning by Reuse, New, Adapt and Skip: A Hierarchical Exploration-Exploitation Approach
Abstract: To effectively manage the complexities of realworld dynamic environments, continual learning must incrementally acquire, update, and accumulate knowledge from a stream of tasks of different nature—without suffering from catastrophic forgetting of prior knowledge. While this capability is innate to human cognition, it remains a significant challenge for modern deep learning systems.At the heart of this challenge liesthe stability-plasticity dilemma: the need to balance leveraging prior knowledge, integrating novel information, and allocating model capacity adaptively based on task complexity and synergy. In this paper, we propose a novel exemplar-free class-incremental continual learning (ExfCCL) framework that addresses these issues through a Hierarchical Exploration-Exploitation (HEE) approach. The core of our method is a HEE-guided efficient neural architecture search (HEE-NAS) that enables a learning-to-adapt backbone via four primitive operations - reuse, new, adapt, and skip—thereby serving as an internal memory that dynamically updates selected components across streaming tasks. To address the task ID inference problem in ExfCCL, we exploit an external memory of task centroids proposed in the prior art. We term our methodCHEEM(Continual Hierarchical-Exploration-Exploitation Memory). CHEEM is evaluated on the challenging MTIL and VDD benchmarks using both Tiny and Base Vision Transformers and a proposed holisticFigure-of-Merit (FoM) metric. It significantly outperforms state-of-the-art prompting-based continual learning methods, closely approaching full fine-tuning upper bounds. Furthermore, it learns adaptive model structures tailored to individual tasks in a semantically meaningful way.
PaperID: 884,   Poster  https://arxiv.org/pdf/2603.24295     GitHub
Authors: Kai Zhu, Zhenyu Cui, Zehua Zang, Jiahuan Zhou
Title: RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
Abstract: Recently, state space models have demonstrated efficient video segmentation through linearcomplexity state space compression. However, Video Semantic Segmentation requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency.
PaperID: 885,   Poster  https://arxiv.org/pdf/2602.19497     GitHub
Authors: Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
Title: MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
Abstract: Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce MICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present Dynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence.
PaperID: 886,   Poster  https://arxiv.org/pdf/2604.09088     GitHub
Authors: Yutong Zhang, Jiaxin Chen, Honglin Chen, Kaiqi Zheng, Shengcai Liao, Hanwen Zhong, Weixin Li, Yunhong Wang
Title: Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
Abstract: Memoryefficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code will be released upon acceptance.
PaperID: 887,   Poster  https://arxiv.org/pdf/2603.11492     GitHub
Authors: Xiaogang Du, Jiawei Zhang, Tongfei Liu, Tao Lei, Yingbo Wang
Title: SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
Abstract: In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pretrained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes graph partitioning as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code will be released.
PaperID: 888,   Poster  https://arxiv.org/pdf/2512.19683     GitHub
Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang
Title: From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence—crucial for robust and grounded AI systems—remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domainspecific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum—from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
PaperID: 889,   Poster  https://arxiv.org/pdf/2512.08500     GitHub
Authors: Jianan Li, Xiao Chen, Tao Huang, Tien-Tsin Wong
Title: Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
Abstract: Video data is more costeffective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducingMimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data.
PaperID: 890,   Poster  https://arxiv.org/pdf/2603.26362     GitHub
Authors: MD Khalequzzaman Chowdhury Sayem, Mubarrat Chowdhury, Yihalem Tiruneh, Muneeb Ahmed Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek
Title: HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
Abstract: Understanding the finegrained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human–AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning—especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%). Code and dataset will be released upon acceptance.
PaperID: 891,   Poster  https://arxiv.org/pdf/2603.10125     GitHub
Authors: Jin Lyu, Liang An, Pujin Cheng, Yebin Liu, Xiaoying Tang
Title: 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
Abstract: 4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is timeconsuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Code and data will be released.
PaperID: 892,   Poster  https://arxiv.org/pdf/2603.08147     GitHub
Authors: Hunor Laczko, Libang Jia, Phat Truong, Diego Hernández, Sergio Escalera, Jordi Gonzàlez, Meysam Madadi
Title: MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Abstract: Existing 4D human datasets fall short for fashionspecific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g., tucked shirts, rolled sleeves). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis.
PaperID: 893,   Poster  https://arxiv.org/pdf/2512.06750     GitHub
Authors: Weiqi Li, Xuanyu Zhang, Bin Chen, Jingfen Xie, Yan Wang, Kexin Zhang, Junlin Li, Li zhang, Jian Zhang, Shijie Zhao
Title: UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement
Abstract: Image quality assessment (IQA) and image restoration are fundamental problems in lowlevel vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be made available.
PaperID: 894,   Poster  https://arxiv.org/pdf/2603.23030     GitHub
Authors: ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo
Title: Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
Abstract: A slidingwindow inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance.
PaperID: 895,   Poster  https://arxiv.org/pdf/2512.04784     GitHub
Authors: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
Title: PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling
Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images,which is essential for applications such as storytelling and character design.Supervised training approaches struggle with this task due to the lack of largescale datasets capturing visual consistency and the complexity of modeling human perceptual preferences.In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner.To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm.The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing.It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons.The second component, PaCo-GRPO,leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost,alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization.Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency,and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability.Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation.We will release the code, dataset, and models.
PaperID: 896,   Poster  https://arxiv.org/pdf/2603.10598     GitHub
Authors: Yawen Yang, Feng Li, Shuqi Kong, Yunfeng Diao, Xinjian Gao, Zenglin Shi, Meng Wang
Title: Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AIgenerated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness.
PaperID: 897,   Poster  https://arxiv.org/pdf/2512.09663     GitHub
Authors: Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan
Title: IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
Abstract: Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduceIFBench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs.
PaperID: 898,   Poster  https://arxiv.org/pdf/2601.02256     GitHub
Authors: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Yi Jiang, Xu Wang, Jia Jia, Daniel Kang Du, Xinglong Wu
Title: VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Abstract: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment.To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide earlystage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
PaperID: 899,   Poster  https://arxiv.org/pdf/2603.24373     GitHub
Authors: Cheng Cui, yubo zhang, Ting Sun, Xueqing Wang, Hongen Liu, Lin Manhui, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu
Title: PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
Abstract: The advent of “OCR 2.0” and largescale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance comparable to billion-parameter VLMs, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR.
PaperID: 900,   Poster  https://arxiv.org/pdf/2603.09277     GitHub
Authors: Jiaqi Liu, Zhizhong Han
Title: Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
Abstract: 3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with stateof-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.
PaperID: 901,   Poster  https://arxiv.org/pdf/2603.03967     GitHub
Authors: Qianfeng Yang, Qiyuan Guan, Xiang Chen, Jiyu Jin, Guiyue Jin, Jiangxin Dong
Title: UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
Abstract: Despite significant progress has been made in image deraining, we note that most existing methods are often developed for only specific types of rain degradation and fail to generalize across diverse realworld rainy scenes. How to effectively model different rain degradations within a universal framework is important for real-world image deraining. In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. To better enhance unified model generalization, we construct an intelligent retrieval augmented generation (RAG)-based dataset distillation pipeline that selects high-quality training samples from all public deraining datasets for better mixed training. Furthermore, we incorporate a simple yet effective multi-objective reweighted optimization strategy into the asymmetric mixture-of-experts (MoE) architecture to facilitate consistent performance and improve robustness across diverse scenes. Extensive experiments show that our framework performs consistently favorably against the state-of-the-art models on both our proposed benchmarks and multiple public datasets. Code and dataset will be available.
PaperID: 902,   Poster  https://arxiv.org/pdf/2503.04666     GitHub
Authors: Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Christy Koh, Pieter-Jan Kindermans, Cordelia Schmid
Title: What Are You Doing? A Closer Look at Controllable Human Video Generation
Abstract: Highquality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human synthesis. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok, TED-Talks, and HumanVid, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing 'What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1,544 captioned videos that have been meticulously collected and annotated with fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose an evaluation framework, where we adapt existing metrics for better human- and video-level assessment, as shown by human preference. Equipped with our dataset and metrics, we perform in-depth analyses of state-of-the-art open-source models in controllable image-to-video generation, showing how WYD provides novel insights about their capabilities. We release data and code to drive progress in human video generation.
PaperID: 903,   Poster  https://arxiv.org/pdf/2603.00887     GitHub
Authors: Longmi Gao, Pan Gao
Title: VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
Abstract: Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an AxialLateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves state-of-the-art performance while maintaining a lower computational footprint.
PaperID: 904,   Poster  https://arxiv.org/pdf/2505.24840     GitHub
Authors: Yuwen Tan, Yuan Qing, Boqing Gong
Title: The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Abstract: This paper reveals that many opensource large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.
PaperID: 905,   Poster  https://arxiv.org/pdf/2603.24653     GitHub
Authors: Francesco Gentile, Nicola Dall'Asen, Francesco Tonini, Massimiliano Mancini, Lorenzo Vaquero, Elisa Ricci
Title: From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
Abstract: As visionlanguage models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP’s vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields monosemantic, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features. Code will be released.
PaperID: 906,   Poster  https://arxiv.org/pdf/2512.16899     GitHub
Authors: Yushi Hu, Reyhane Askari, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad
Title: Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: textto-image, image editing, interleaved generation, and multimodal reasoning (“thinking-with-images”), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models, and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. Top judges like GPT-5 and Gemini-2.5-Pro reach 66–75% accuracy, compared to >90% for humans, and outperform the commonly used GPT-4o (59% accuracy). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini-2.5-Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success, and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
PaperID: 907,   Poster  https://arxiv.org/pdf/2603.27151     GitHub
Authors: Kenji Tojo, Bernd Bickel, Nobuyuki Umetani
Title: DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
Abstract: Radiance field reconstruction aims to recover highquality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, have achieved real-time rendering with high visual fidelity, given sufficiently powerful graphics hardware. However, drastic model simplification — i.e., reducing the number of primitives by several orders of magnitude — is required to enable efficient online transmission and rendering across diverse hardware platforms. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured primitives) of a small number of triangles with neural textures that have binary opacity. We show that the binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without molifier (i.e., smooth rasterization). DiffSoup can be rasterized with a traditional depth-testing framework, allowing the optimized scenes to be seamlessly integrated into conventional graphics pipelines and rendered interactively on consumer-grade laptops and mobile devices.
PaperID: 908,   Poster  https://arxiv.org/pdf/2602.20873     GitHub
Authors: Jiahao Xu, Sheng Huang, Xin Zhang, Zhixiong Nan, Jiajun Dong, Nankun Mu
Title: MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Abstract: In computational pathology, fewshot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization.
PaperID: 909,   Poster  https://arxiv.org/pdf/2512.13874     GitHub
Authors: Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi
Title: SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Abstract: As humans, we are natural anyhorizon reasoners,i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question:Is it possible to develop performant any-horizon video reasoning systems?Inspired by human behavior, we first proposeSAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator,SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curateSAGE-Benchwith an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to6.1%on open-ended video reasoning tasks, as well as an impressive8.2%improvement on videos longer than 10 minutes. We will open-source our system code, data, and checkpoints upon publication.
PaperID: 910,   Poster  https://arxiv.org/pdf/2604.05731     GitHub
Authors: Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang, Wenwu Wang, Zhifeng Xie
Title: FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
Abstract: Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatiotemporal aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporal controllable Foley generation, and professional mixing capabilities.Technically, FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatiotemporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate film industry-grade post-production practices. To address the lack of high-quality stere Foley datasets in film, we introduce FilmStereo, the first professional stereo Foley dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For application, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility.Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with integration validated in film industrial-grade workflows.
PaperID: 911,   Poster  https://arxiv.org/pdf/2511.20614     GitHub
Authors: Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou
Title: The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent finegrained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
PaperID: 912,   Poster  https://arxiv.org/pdf/2602.23559     GitHub
Authors: Cho-Ying Wu, Zixun Huang, Xinyu Huang, Liu Ren
Title: No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
Abstract: We present the first study of crosssensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection. Code will be released.
PaperID: 913,   Poster  https://arxiv.org/pdf/2603.21629     GitHub
Authors: Wen Guo, Pengfei Zhao, Zongmeng Wang, Yufan Hu, Junyu Gao
Title: Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
Abstract: Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various realworld scenarios.However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT.Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts.However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos.Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework.In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions.Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation.Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts.
PaperID: 914,   Poster  https://arxiv.org/pdf/2603.01361     GitHub
Authors: Zilong Zhao, Zhengming Ding, Pei Niu, Wenhao Sun, Feng Guo
Title: MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Abstract: Feature encoders play a key role in pixellevel crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba’s latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a multi-view processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability.
PaperID: 915,   Poster  https://arxiv.org/pdf/2603.00912     GitHub
Authors: Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
Title: VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection
Abstract: Current multiview indoor 3D object detectors rely on sensor geometry that is costly to obtain—i.e., precisely calibrated multi-view camera poses—to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure. (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to ‘see’ what they need, then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses SG-Free baselines by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT’s internally learned semantic and geometric priors can be effectively leveraged by our AG and QD. Code and pretrained models will be released.
PaperID: 916,   Poster  https://arxiv.org/pdf/2603.14214     GitHub
Authors: Xingyuan Li, Songcheng Du, Yang Zou, HaoYuan Xu, Zhiying Jiang, Jinyuan Liu
Title: UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
Abstract: Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multimodal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion’s superior visual quality, generalization ability, and adaptability to real-world scenarios.
PaperID: 917,   Poster  https://arxiv.org/pdf/2512.03746     GitHub
Authors: Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin
Title: Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Abstract: Multimodal large language models (MLLMs) that ''think with images'' can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited realworld necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes, underscoring the need for more robust tool-based reasoning. To address this, we proposeCodeVision, a flexible and scalable ``code-as-tool'' framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Codes and datasets will be available.
PaperID: 918,   Poster  https://arxiv.org/pdf/2602.22949     GitHub
Authors: Junuk Cha, Jihyeon Kim, Han-Mu Park
Title: OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
Abstract: Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signinghand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. We will release the code and processed data.
PaperID: 919,   Poster  https://arxiv.org/pdf/2603.26179     GitHub
Authors: bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan, Zhuotao Tian, Jingyong Su
Title: Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
Abstract: Recent advances in openvocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D^3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Data, code and models will be made publicly available.
PaperID: 920,   Poster  https://arxiv.org/pdf/2603.25058     GitHub
Authors: Xuankai Zhang, Junjin Xiao, Shangwei Huang, Wei-Shi Zheng, Qing Zhang
Title: Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
Abstract: We present an approach for highquality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code and trained model will be made publicly available.
PaperID: 921,   Poster  https://arxiv.org/pdf/2604.03619     GitHub
Authors: Peter Yongho Kim, Juhyeon Park, Jungwoo Park, Jubin Choi, Jungwoo Seo, Jiook Cha, Taesup Moon
Title: Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
Abstract: Modeling longrange spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks.Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity.
PaperID: 922,   Poster  https://arxiv.org/pdf/2604.06052     GitHub
Authors: Katarzyna Zaleska, Łukasz Popek, Monika Wysoczańska, Kamil Deja
Title: Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
Abstract: Textto-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focused on prompt-related interventions, we notice that such explicit conditioning might differ from the implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers causally responsible for making the decisions. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches.
PaperID: 923,   Poster  https://arxiv.org/pdf/2603.04745     GitHub
Authors: Yang Zou, Jun Ma, Zhidong Jiao, Xingyuan Li, Zhiying Jiang, Jinyuan Liu
Title: Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Abstract: Infrared Image SuperResolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments on real and synthetic datasets demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking.
PaperID: 924,   Poster  https://arxiv.org/pdf/2602.21100     GitHub
Authors: Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib
Title: Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Abstract: Reconstructing highfidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.
PaperID: 925,   Poster  https://arxiv.org/pdf/2603.26188     GitHub
Authors: Rui Wang, Huisi Wu, Jing Qin
Title: OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
Abstract: Accurate segmentation of cardiac chambers in echocardiography videos is essential for quantitative cardiac assessment. However, ultrasound noise, artifacts, and cardiac motion pose significant challenges to robust spatiotemporal modeling. Recent approaches such as Transformers, linear attention, and statespace models improve accuracy, yet Transformers often remain computationally expensive, whereas linear attention and state-space models typically lack geometric regularization, leading to unstable spatiotemporal interactions under complex cardiac motion. We introduce OSA, a lightweight linear sequence architecture designed for stable and efficient cardiac video segmentation. OSA incorporates an Anatomical Prior-aware Feature Enhancement (APFE) module that decouples and fuses complementary anatomical components to strengthen boundary–region discrimination. Orthogonalized State Update (OSU) enforces spectral-norm and orthogonality constraints during recurrent transitions, preserving spatiotemporal coherence. Evaluated on the CAMUS and EchoNet-Dynamic datasets, OSA consistently outperforms state-of-the-art methods in segmentation accuracy and temporal consistency, while maintaining real-time inference efficiency. This framework offers a principled and efficient solution for dynamic cardiac analysis in echocardiography. The code will be released upon publication.
PaperID: 926,   Poster  https://arxiv.org/pdf/2602.23980     GitHub
Authors: Tianxiang Du, Hulingxiao He, Yuxin Peng
Title: Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
Abstract: The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) — an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first largescale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation.
PaperID: 927,   Poster  https://arxiv.org/pdf/2512.01390     GitHub
Authors: Seungho Choi, Jeahun Sung, Jihyong Oh
Title: FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Abstract: Realimage super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise “low-first, high-later” hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model’s internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives. The code and project page will be publicly released.
PaperID: 928,   Poster  https://arxiv.org/pdf/2512.07580     GitHub
Authors: Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou
Title: When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
Abstract: Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing trainingfree pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by"vanishing token information'', where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as ``information horizon", beyond which the visual tokens become redundant;(2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA);(3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5).Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DART with random pruning achieves state-of-the-art results, maintaining 93.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens.
PaperID: 929,   Poster  https://arxiv.org/pdf/2510.09110     GitHub
Authors: Weikai Huang, Jieyu Zhang, Taoyang jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna
Title: Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Abstract: Visual grouping—operationalized through tasks such as instance segmentation, visual grounding, and object detection—enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by largescale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24–36%—achieving +10.9 AP on LVIS and +8.4 N_\textAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
PaperID: 930,   Poster  https://arxiv.org/pdf/2603.22794     GitHub
Authors: lishen qu, Shihao Zhou, Jie Liang, Hui Zeng, Lei Zhang, Jufeng Yang
Title: It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Abstract: Flicker artifacts, arising from unstable illumination and rowwise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting.Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network’s ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available in the supplementary materials.
PaperID: 931,   Poster  https://arxiv.org/pdf/2603.20970     GitHub
Authors: Uzair Shah, Marco Agus, Mahmoud Gamal, Mahmood Alzubaidi, Corrado Cali, PIERRE MAGISTRETTI, Abdesselam Bouzerdoum, Mowafa Househ
Title: GraPHFormer: a multimodal graph persistent homology transformer for the analysis of neuroscience morphologies
Abstract: Quantitative analysis of neural morphology is central to understanding how circuits develop, compute, and fail. Skeletonized reconstructions of neurons and glia enable systematic study of branching patterns, path lengths, tapering, and spatial organization, with implications for neurodevelopment, learning and memory, and neurodegenerative disease. Current learning pipelines often treat either topology (via persistent homology) or graph structure (via graph neural networks) in isolation. We argue that these views are complementary and introduce \emphGraPHFormer, a multimodal architecture that fuses topological and graph representations for cell morphology analysis. Our vision branch operates on a novel threechannel persistence image derived from the morphological tree: an unweighted TMD-style density, a branch-length channel (persistence), and a branch-radius channel (mean radius along death-to-leaf paths). In parallel, a graph Transformer processes the original skeleton with geometric/radial attributes. We explore lightweight fusion strategies (late fusion and cross-attention) and train under both supervised and contrastive regimes. We extensively assessed GraPHFormer through established morphology benchmarks, and we showcase that it consistently and significantly outperforms strong topology-only, graph-only, and morphometrics baselines. Beyond accuracy, we demonstrate practical relevance by discriminating neuronal and glial morphologies across cortical areas and species, and by detecting signatures associated with developmental trajectories and degenerative conditions.
PaperID: 932,   Poster  https://arxiv.org/pdf/2603.21904     GitHub
Authors: Linkuan Zhou, Yinghao Xia, Yufei Shen, Xiangyu Li, Wenjie Du, Cong Cong, leyi wei, Ran Su, Qiangguo Jin
Title: SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Abstract: Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudolabel validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture.This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability.SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI\toCT) and 78.51% (CT\toMRI) on cardiac data, and 87.48% (MRI\toCT) and 86.89% (CT\toMRI) on abdominal data.
PaperID: 933,   Poster  https://arxiv.org/pdf/2603.22819     GitHub
Authors: Qin Chunxia, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin, Bing Yin, Cong Liu
Title: TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
Abstract: Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows.Endto-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios.To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment.TDATR adopts a “perceive-then-fuse” strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness.The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data.Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision–language alignment. It enhances the interpretability and accuracy of TR.We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.
PaperID: 934,   Poster  https://arxiv.org/pdf/2603.12533     GitHub
Authors: Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou
Title: Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
Abstract: Understanding and answering questions based on a user’s pointing gesture is essential for nextgeneration egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video.To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks.Built upon it, we further propose Hand Intent Tokens (HINT), which encode tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent.We show that our model outperforms others in different backbones and model sizes.In particular, HINT-14Bachieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%.To further facilitate the open research, we will release the code, model, and dataset.
PaperID: 935,   Poster  https://arxiv.org/pdf/2512.22939     GitHub
Authors: Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li
Title: ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent endto-end (E2E) systems learn them jointly. Vision–language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision–language–action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
PaperID: 936,   Poster  https://arxiv.org/pdf/2512.17717     GitHub
Authors: Cheng Peng, Zhuo Su, Liao Wang, Chen Guo, Zhaohu Li, Chengjiang Long, Zheng Lv, Jingxiang Sun, Chenyangguang Zhang, Yebin Liu
Title: FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
Abstract: We present FlexAvatar, a flexible large reconstruction model for highfidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation.For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set.Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality.Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.
PaperID: 937,   Poster  https://arxiv.org/pdf/2511.23151     GitHub
Authors: Jin-Seop Lee, Sungjoon Lee, SeongJun Jung, Boyang Li, Jee-Hyong Lee
Title: Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
Abstract: Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hardirrelevant queries that are semantically similar but not actually relevant.To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG.Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives—format, refuse-IoU, explain, and query correction—to improve both relevance discrimination and fine-grained semantic reasoning.In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers.We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models.
PaperID: 938,   Poster  https://arxiv.org/pdf/2604.03723     GitHub
Authors: Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang
Title: SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Abstract: Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camerainduced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.
PaperID: 939,   Poster  https://arxiv.org/pdf/2512.05422     GitHub
Authors: Jiangtong Tan, Lin Liu, Jie Huang, Xiaopeng Zhang, Qi Tian, Feng Zhao
Title: ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
Abstract: Unified multimodal models significantly improve visual generation by combining visionlanguage models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose ParaUni. It extracts features from variants VLM's layers in a Parallel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in Unified multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages.
PaperID: 940,   Poster  https://arxiv.org/pdf/2603.21055     GitHub
Authors: Pengchong Hu, Zhizhong Han
Title: SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
Abstract: 3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or viewtied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity.
PaperID: 941,   Poster  https://arxiv.org/pdf/2603.14153     GitHub
Authors: Junyao Hu, Zhongwei Cheng, Waikeung Wong, Xingxing Zou
Title: Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
Abstract: Virtual tryon (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple items, layering, fine-grained categories, and diverse styling—beyond current systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, thefirstlarge-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 361+ fine-grained subcategories. Each pair includes 3-12 item images, a model image in a complete outfit, and detailed item and try-on annotations. We further design a synthesis pipeline balancing authenticity and diversity: it maximizes use of raw images for realism, and explicitly injects diverse styles and specific styling techniques during outfit/look synthesis. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editors to establish baselines. Results show current methods struggle to try on full outfits seamlessly and to infer correct layering, leading to misalignment and artifacts. All data will be open-sourced.
PaperID: 942,   Poster  https://arxiv.org/pdf/2603.24134     GitHub
Authors: Haoyu Ji, Bowen Chen, Zhihao Yang, Wenze Huang, Yu Gao, Xueting Liu, Weihong Ren, Zhiyong Wang, Honghai LIU
Title: Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
Abstract: Skeletonbased Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and categorical confusions. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance.
PaperID: 943,   Poster  https://arxiv.org/pdf/2603.10128     GitHub
Authors: Daichao Zhao, Qiupu Chen, Feng He, Xin Ning, Qiankun Li
Title: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
Abstract: Lane detection is a crucial task in autonomous driving, which is conducive to ensuring the safe operation of vehicles. However, current datasets like CULane and TuSimple have relatively limited data under extreme weather conditions, such as rain, snow and fog, which makes detection models unreliable in extreme conditions, potentially leading to serious safetycritical failures on the road. In this direction, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions, without the need for re-annotation and training. Based on our framework, we further propose a benchmark that includes adverse weather and lighting conditions, with 30,000 images. Experiment results demonstrate that our method constantly and significantly improves the detection performance of all the related lane detection networks. Taking the state-of-the-art CLRNet as an example, the overall mF1 on our benchmark increases by 20.87%. The F1@50 for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75%, 8.63%, 38.8%, 14.96%, 26.84%, 21.5%, and 12.04%, respectively. Code and dataset are included in the supplementary materials.
PaperID: 944,   Poster  https://arxiv.org/pdf/2604.15311     GitHub
Authors: Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng, Liang Zheng
Title: LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
Abstract: This paper focuses on the alignment of flowmatching models with human preference. A promising way is fine-tuning by directly backpropagating reward signals through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early-step latents.Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of terms with large gradients, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image–text alignment.
PaperID: 945,   Poster  https://arxiv.org/pdf/2603.00611     GitHub
Authors: Lijing Cai, Zhan Shi, Chenglong Huang, Jinyao Wu, Qiping Li, Zikang Huo, Linsen Chen, Chongde Zi, Xun Cao
Title: Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
Abstract: Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily imagebased, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs.
PaperID: 946,   Poster  https://arxiv.org/pdf/2603.27383     GitHub
Authors: Nazia Tasnim, Shrimai Prabhumoye, Bryan A. Plummer
Title: Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
Abstract: Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network, and encompasses tasks like ParameterEfficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model's parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (\method), a general approach that can address multiple PR tasks within the same framework, which can enable seamless integration. It accomplishes this by using a factorization process that decomposes pretrained weights into basis matrices and their component projections. Sharing these basis matrices across layers and adjusting its size enables us to perform MC, whereas the small size of the projection weights (fewer than 200 in some experiments) enables \method support PEFT. Experiments on ViT models show \method outperforms methods from prior work capable of dual-task applications by 4-5% while also outperforming the state-of-the-art in PEFT by 1.5% and PEFT+MC combinations by almost 1%.
PaperID: 947,   Poster  https://arxiv.org/pdf/2603.00152     GitHub
Authors: Haoxiang Sun, Tao Wang, Chenwei Tang, Li Yuan, Jiancheng Lv
Title: Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Abstract: Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a longstanding yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.Seg improves performance in complex visual scenarios while maintaining strong generalization. Code, data, and models will be publicly released upon acceptance.
PaperID: 948,   Poster  https://arxiv.org/pdf/2511.18050     GitHub
Authors: Tian Ye, Song Fei, Lei Zhu
Title: UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Abstract: Diffusion transformers have recently delivered strong textto-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data--model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval@4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and—with a LLM prompt refiner—matches or surpasses the proprietary Seedream 4.0.
PaperID: 949,   Poster  https://arxiv.org/pdf/2511.21688     GitHub
Authors: Wenbo hu, JINGLI LIN, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
Title: G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Abstract: VisionLanguage Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G^2VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G^2VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experimental results demonstrate G^2VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks.By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G^2VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
PaperID: 950,   Poster  https://arxiv.org/pdf/2506.08013     GitHub
Authors: Anh Quan Cao, Ivan Lopes, Raoul de Charette
Title: StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
Abstract: Multitask learning for dense prediction is limited by the need for extensive annotation for every task, although recent works have explored training with partial task labels.Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes an image generator for latent regression. Adapting a denoising framework with task encoding, task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient N-to-one attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
PaperID: 951,   Poster  https://arxiv.org/pdf/2512.12799     GitHub
Authors: Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao
Title: DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Abstract: Although multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating finegrained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, despite utilizing only a 0.5B Qwen2.5 model as the MLLM, our proposed DrivePI still maintains promising textual scene understanding while achieving competitive performance in 3D perception, prediction, and planning tasks. Moreover, DrivePI even surpasses most specialized vision-based models across these tasks, highlighting the effectiveness of our unified approach. We hope this new VLA framework can inspire future research to enhance autonomous driving systems with improved interpretability and explainable decision-making through language reasoning and fine-grained 3D outputs. To facilitate future research, we will release the code and annotated datasets.
PaperID: 952,   Poster  https://arxiv.org/pdf/2512.04926     GitHub
Authors: Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng LIN, MINGYU GUO, Chong Luo, Nanning Zheng
Title: Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Abstract: Latent Diffusion Models (LDMs) inherently follow a coarseto-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining the compact semantic latent, which is extracted from pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256×256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100× faster convergence than original DiT without guidance. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.
PaperID: 953,   Poster  https://arxiv.org/pdf/2603.27645     GitHub
Authors: Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong
Title: OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
Abstract: Openvocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and perform similarity retrieval with change proposals in the visual space during inference. The other secondary bottleneck lies in change localization, due to the lack of change priors in VFMs under unsupervised settings. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for binary change localization. Integrating the pretrained S2C into OpenDPR leads to a weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available in the Supplementary Material.
PaperID: 954,   Poster  https://arxiv.org/pdf/2604.02093     GitHub
Authors: Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, KAI DAI, Zhao Yang
Title: GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
Abstract: Video Temporal Grounding (VTG) is a critical task in video understanding and a key capability for extending Video Large Language Models (VidLLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues.To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate our model on three standard VTG benchmarks, where GroundVTS outperforms state-of-the-art methods, achieving a +7.7% mIoU improvement on moment retrieval and +12.0% mAP on highlight detection.Code will be publicly available.
PaperID: 955,   Poster  https://arxiv.org/pdf/2602.20537     GitHub
Authors: Xinyong Cai, Changbin Sun, Yong Wang, Hongyu Yang, Yuankai Wu
Title: PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
Abstract: Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center–surround organization and frequencyselective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions (1× k followed by k×1), reducing per-channel computational cost from \mathcalO(k^2) to \mathcalO(2k). PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs.
PaperID: 956,   Poster  https://arxiv.org/pdf/2603.24721     GitHub
Authors: Shengli Zhou, Minghang Zheng, Feng Zheng, Yang Liu
Title: Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Abstract: Spatial reasoning is the process of locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scenelanguage paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches.
PaperID: 957,   Poster  https://arxiv.org/pdf/2603.14507     GitHub
Authors: Zhuoxuan Peng, Boan Zhu, Xingjian Zhang, Wenying Li, S.-H. Gary Chan
Title: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
Abstract: Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudolabel estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.
PaperID: 958,   Poster  https://arxiv.org/pdf/2602.10815     GitHub
Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
Title: Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Abstract: The adaptation of largescale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL’s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization.
PaperID: 959,   Poster  https://arxiv.org/pdf/2604.07399     GitHub
Authors: Wonseon Lim, Jaesung Lee, Dae-Won Kim
Title: Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Abstract: Continual learning (CL) on edge devices requires not only high accuracy but also trainingtime efficiency to support on-device model adaptation under limited memory and compute resources. While prompt-based continual learning (PCL) achieves strong performance with few learnable parameters, existing studies primarily optimize accuracy or inference efficiency, overlooking the cost of on-device training. In this paper, we propose CPS-Prompt, a critical patch-aware sparse prompting framework that enhances training efficiency with minimal accuracy loss by combining Critical Patch Sampling (CPS) for task-aware token selection and Decoupled Prompt–Classifier Training (DPCT) for representation alignment. Extensive experiments across three public datasets demonstrate that CPS-Prompt reduces peak memory usage and training time by 36% and 35%, respectively, while maintaining accuracy within 2% of the state-of-the-art method, C-Prompt, and matching the balanced CODA-Prompt baseline.
PaperID: 960,   Poster  https://arxiv.org/pdf/2602.19206     GitHub
Authors: Zehao Deng, An Liu, Yan Wang
Title: GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning
Abstract: Zeroshot 3D Anomaly Detection (ZS3DAD) is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In the stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In the stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection and segmentation. Code will be released upon acceptance.
PaperID: 961,   Poster  https://arxiv.org/pdf/2604.00549     GitHub
Authors: Zhijin He, Shuo Jin, Siyue Yu, Shuwei Wu, Bingfeng Zhang, Li Yu, Jimin Xiao
Title: TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
Abstract: Cosalient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes will be released.
PaperID: 962,   Poster  https://arxiv.org/pdf/2603.12764     GitHub
Authors: Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, Hongliang Li
Title: SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion
Abstract: Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a singleview setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego\rightarrowExo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align–Fuse–Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components.
PaperID: 963,   Poster  https://arxiv.org/pdf/2509.23728     GitHub
Authors: Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang
Title: M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Abstract: In textdriven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.
PaperID: 964,   Poster  https://arxiv.org/pdf/2603.10526     GitHub
Authors: Pei Liu, xiangxiang Zeng, Tengfei Ma, Yucheng Xing, Xuanbai Ren, Yiping Liu
Title: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis
Abstract: WholeSlide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology.Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (\textSTEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that \textSTEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference.
PaperID: 965,   Poster  https://arxiv.org/pdf/2603.05012     GitHub
Authors: Yulong Shi, Shijie Li, Ziyi Li, Lin Qi
Title: Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for lowgap, specific domain shifts and lack the ability to generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight source model to the target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in the original target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation.
PaperID: 966,   Poster  https://arxiv.org/pdf/2602.23029     GitHub
Authors: Tianyue Wang, Leigang Qu, tianyu yang, xiangzhao hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
Title: WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Abstract: ZeroShot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality—either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we proposeWISER, a training-free framework that unifies T2I and I2I via a “retrieve–verify–refine” pipeline, explicitly modelingintent awarenessanduncertainty awareness. Specifically, WISER first performsWider Searchby generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conductsAdaptive Fusionwith a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round towardDeeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released.
PaperID: 967,   Poster  https://arxiv.org/pdf/2604.13746     GitHub
Authors: Jie Liang, Jiahao Wu, Chao Wang, Jiayu Yang, Xiaoyun Zheng, Kaiqiang Xiong, Zhanke Wang, Jinbo Yan, Feng Gao, Ronggang Wang
Title: ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
Abstract: Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multiview sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency.
PaperID: 968,   Poster  https://arxiv.org/pdf/2604.00455     GitHub
Authors: Jiwoo Ha, Jongwoo Baek, Jinhyun So
Title: First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
Abstract: Recent Large VisionLanguage Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs.However, object hallucination — the generation of nonexistent objects in answers — remains a persistent challenge.Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity.Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses.In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs.FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information.We observe that FLB (1) sustains the visual information embedded in the first token throughout generation,and (2) suppresses hallucinated words through the stabilizing effect of the “The” token.Experimental results show that FLB significantly reduces object hallucination on AMBER and CHAIR benchmarks.Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems.
PaperID: 969,   Poster  https://arxiv.org/pdf/2604.14176     GitHub
Authors: Haiyang Zheng, Nan Pu, Yaqi Cai, Teng Long, Wenjing Li, Nicu Sebe, Zhun Zhong
Title: The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
Abstract: Generalized Category Discovery (GCD) aims to categorize unlabeled samples that may belong to either known or unknown categories by leveraging the knowledge from labeled data. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue,i.e.,gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representationsubspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. EAGC can be seamlessly integrated with both parametric and non-parametric GCD methods. Experiments show that EAGC consistently boosts existing approaches and establishes new state-of-the-art results on multiple GCD benchmarks.
PaperID: 970,   Poster  https://arxiv.org/pdf/2603.25135     GitHub
Authors: Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim
Title: EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under handsbusy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios—industrial maintenance, sports, and emergency rescue—designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset will be publicly available.
PaperID: 971,   Poster  https://arxiv.org/pdf/2512.01686     GitHub
Authors: Patrick Kwon, Chen Chen
Title: DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Abstract: Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layoutaware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy.
PaperID: 972,   Poster  https://arxiv.org/pdf/2603.22763     GitHub
Authors: Ao Cheng, Xingming Li, Xuanyu Ji, Xixiang He, Qiyao Sun, Chunping Qiu, Runke Huang, Qingyong Hu
Title: ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Abstract: Electronic Navigational Charts (ENCs) are the safetycritical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.
PaperID: 973,   Poster  https://arxiv.org/pdf/2603.21287     GitHub
Authors: Yuntian Bo, Yazhou Zhu, Piotr Koniusz, Haofeng Zhang
Title: Focus on Background: Exploring SAM's Potential in Few-Shot Medical Image Segmentation with Background-Centric Prompting
Abstract: Conventional fewshot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM’s over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostically generating support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieves state-of-the-art performance on FSMIS, and further exhibits strong cross-domain generalization. All code will be released upon acceptance.
PaperID: 974,   Poster  https://arxiv.org/pdf/2602.20880     GitHub
Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Title: When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Abstract: Textto-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to ``harmful conflicts'' where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model’s evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
PaperID: 975,   Poster  https://arxiv.org/pdf/2602.20060     GitHub
Authors: junli wang, Yinan Zheng, Xueyi Liu, Zebin Xing, Pengfei Li, Kun Ma, Hangjun Ye, Guang Chen, Guang Li, Long Chen, Zhongpu Xia, Qichao Zhang
Title: MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving
Abstract: Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchorguided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce ``MeanFlow Identity", which models the mean velocity field between GMN and data distribution instead of the instantaneous velocity field used in naïve flow-matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to consider all sampled proposals and adaptively decide whether to reconstruct a trajectory when none of the proposals is satisfactory. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving.
PaperID: 976,   Poster  https://arxiv.org/pdf/2511.22119     GitHub
Authors: Mingzhe Li, Renhao 'Norman' Zhang, Zhiyang Wen, Siqi Pan, Bruno da Silva, Juan Zhai, Shiqing Ma
Title: PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
Abstract: Textto-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning–based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5% in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness.
PaperID: 977,   Poster  https://arxiv.org/pdf/2509.13615     GitHub
Authors: Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu
Title: See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we proposeStateawareReasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at Anonymous.
PaperID: 978,   Poster  https://arxiv.org/pdf/2603.28182     GitHub
Authors: Xuanlong Yu, Youyang Sha, Longfei Liu, Xi Shen, Di Yang
Title: A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
Abstract: Fewshot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Our code will be released upon publication.
PaperID: 979,   Poster  https://arxiv.org/pdf/2603.09326     GitHub
Authors: tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming
Title: OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision–language tasks. However, their ability in lowlevel visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis.In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position.Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability.We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence.All resources will be publicly released upon acceptance.
PaperID: 980,   Poster  https://arxiv.org/pdf/2603.29296     GitHub
Authors: Haoran Zhou, Gim Hee Lee
Title: MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
Abstract: Realistic reconstruction of dynamic 4D scenes is essential for understanding the physical world.Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments.To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of highfidelity scene structures and coherent motion representation under complex dynamics.To handle motion with arbitrary flexibility and long-term variation, we introduce a scalable motion field built upon cluster-based bases that adaptively grow to capture diverse motion patterns over time.Moreover, we introduce a progressive optimization strategy that extends naturally to unseen frames. This strategy comprises two propagation modules: 1) A background module that adapts to newly appearing objects, refines camera poses, and accounts for shadows; 2) A foreground module that refines motion through a three-stage process.Extensive experiments on challenging real-world datasets demonstrate that our MotionScale achieves superior reconstruction quality and motion consistency that significantly outperform prior 4D scene reconstruction methods.Our code will be open-sourced on paper acceptance.
PaperID: 981,   Poster  https://arxiv.org/pdf/2504.10018     GitHub
Authors: Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li
Title: RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
Abstract: Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multimodal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released upon acceptance.
PaperID: 982,   Poster  https://arxiv.org/pdf/2602.21698     GitHub
Authors: Meiqi Sun, mingyu Li, Junxiong Zhu
Title: E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
Abstract: Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for ecommerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.
PaperID: 983,   Poster  https://arxiv.org/pdf/2506.13387     GitHub
Authors: Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
Title: TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast
Abstract: This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a crossmodality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop dual-level scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M’s great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance instead of large-size metric depth models with large amounts of training data. Code will be public available upon acceptance.
PaperID: 984,   Poster  https://arxiv.org/pdf/2603.23104     GitHub
Authors: Yik Cheng, Runkai Zhao, Weidong Cai
Title: NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Abstract: 2D visual foundation models, such as DINOv3, a selfsupervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure.
PaperID: 985,   Poster  https://arxiv.org/pdf/2604.14710     GitHub
Authors: jiyoung lim, Heejae Yang, Jee-Hyong Lee
Title: G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bimodal composition. Recent training-free zero-shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training.
PaperID: 986,   Poster  https://arxiv.org/pdf/2504.05296     GitHub
Authors: Gal Fiebelman, Hadar Averbuch-Elor, Sagie Benaim
Title: Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
Abstract: 3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, PhysicsGuided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance prior. This allows for photorealistic refinements while ensuring the dynamics remain plausible. Our framework enables scene-wide dynamic weather effects, including snowfall, rainfall, fog, and sandstorms, with physically plausible motion. Experiments demonstrate our physics-guided approach significantly outperforms baselines, with ablations confirming this joint refinement is essential for generating coherent, high-fidelity dynamics.
PaperID: 987,   Poster  https://arxiv.org/pdf/2602.19753     GitHub
Authors: Kaifa Yang, Qi Yang, Yiling Xu, Zhu Li
Title: RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading technology for highquality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limited scalability and generalization.To address these issues, we propose RAP — a fast feedforward Rendering-free Attribute-guided method for efficient importance score Prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding any rendering-based or visibility-dependent computations. A compact MLP is trained to predict per-primitive importance scores using a combination of rendering loss, pruning-aware loss, and significance distribution regularization loss. After being trained on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines, providing a unified and efficient pruning solution.
PaperID: 988,   Poster  https://arxiv.org/pdf/2602.12370     GitHub
Authors: Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Linguang Zhang, Amy Zhao, Srinath Sridhar, Lingling Tao, Abhay Mittal
Title: LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motionlanguage generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs.Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization.To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation.We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS).Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.We will release the code and model upon acceptance.
PaperID: 989,   Poster  https://arxiv.org/pdf/2604.12159     GitHub
Authors: Parth Parag Kulkarni, Rohit Gupta, Prakash Chandra Chhipa, Mubarak Shah
Title: VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
Abstract: The task of video geolocalization aims to determine the precise GPS coordinates of a video’s origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classificationbased approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using said aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model’s ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. Code and models will be released publicly.
PaperID: 990,   Poster  https://arxiv.org/pdf/2510.16410     GitHub
Authors: Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu
Title: REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Abstract: Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoningbased instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility.
PaperID: 991,   Poster  https://arxiv.org/pdf/2601.10200     GitHub
Authors: Kim Youwang, Lee Hyoseok, Park Subin, Gerard Pons-Moll, Tae-Hyun Oh
Title: ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEsttime generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
PaperID: 992,   Poster  https://arxiv.org/pdf/2602.23523     GitHub
Authors: Junjiang Wu, Liejun Wang, Zhiqing Guo
Title: All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
Abstract: With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments demonstrate that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content.
PaperID: 993,   Poster  https://arxiv.org/pdf/2603.27494     GitHub
Authors: Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu
Title: Learning to Focus and Precise Cropping:A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Abstract: To enhance the perception and reasoning capabilities of multimodal large models (MLLMs) in complex visual scenes, recent research has introduced agentbased workflows. In these works, MLLMs autonomously utilize image cropping to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning (SFT) and reinforcement learning (RL), have made significant progress, our empirical analysis reveals a key limitation. By adding random noise to the cropped images, we find that they still maintain most of the performance, especially for models using only reinforcement learning, indicating a heavy reliance on the global input and a weak dependence on details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the "Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to clipped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs.
PaperID: 994,   Poster  https://arxiv.org/pdf/2603.27481     GitHub
Authors: Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
Title: On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
Abstract: Multimodal Continual Instruction Tuning aims to continually enhance Large VisionLanguage Models by learning from new data without forgetting previously acquired knowledge. Mixture-of-Experts (MoE) architectures support this by adding new experts and expanding routers while keeping existing ones frozen. However, despite expert isolation, they still suffer from forgetting due to router drifting, where old-task tokens are mistakenly attracted to newly added experts, leading to performance degradation, \ie, forgetting. We propose a dynamic MoE approach with drift-aware token assignment to regularize router drifting and mitigate forgetting. We analyze the failure mode and identify its link to how different token types are assigned during training. In particular, tokens with ambiguous assignments between old and new experts tend to cause problems, although some can still be benign or even beneficial.Motivated by this, our proposed LLaVA-DyMoE incrementally expands the MoE and learns with a two-fold regularization strategy that regularizes token assignment and dispatching by representing token types through their routing scores, reducing router drift. Our drift-aware token assignment guidance provides conditional guidance for ambiguous tokens to preserve old patterns, complemented by a pair of synergistic routing losses that enforce separation and promote new expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE outperforms baselines, achieving over a 7% increase in average accuracy and a 12% reduction in forgetting by mitigating this router-drift–induced forgetting.
PaperID: 995,   Poster  https://arxiv.org/pdf/2503.17788     GitHub
Authors: Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu
Title: From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
Abstract: Twohand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two–hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. Guided by collision gradients during denoising, the model converges toward the manifold of valid two-hand interactions, preserving geometric and kinematic coherence. This generative formulation approach enables physically credible reconstructions even under occlusion or ambiguous visual input. Extensive experiments on InterHand2.6M, HIC, and FreiHAND show state-of-the-art or leading performance in interaction alignment and penetration suppression. Code will be released publicly.
PaperID: 996,   Poster  https://arxiv.org/pdf/2603.12746     GitHub
Authors: Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, yunlong lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang
Title: Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Abstract: Humans inhabit a physical 4D world, where spatial geometry and semantic content evolve over time, forming a dynamic reality. While current Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in understanding static visual inputs, it remains unclear whether they can effectively "think in dynamics," i.e., perceive, track, and reason about spatiotemporal evolution in complex scenes.To systematically evaluate these abilities, we introduce \textttDyn-Bench, a large-scale benchmark designed to assess spatio-temporal reasoning and localized dynamics perception. Constructed through multi-stage filtering over massive 2D and 4D data sources, \textttDyn-Bench provides a high-quality collection of diverse dynamic scenes, consisting of 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding samples.We comprehensively study general-purpose, spatial-aware, and region-level MLLMs to understand how they ``think in dynamics'' from both linguistic and visual perspectives. Our results reveal that existing models struggle to jointly excel in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Conventional prompting strategies i.e., chain-of-thought or caption-based hints) provide only limited improvements.In contrast, structured integration approaches, including Mask-Guided Fusion and the Spatio-Temporal Textual Cognitive Map (ST-TCM), substantially enhance MLLMs' dynamic perception and spatio-temporal reasoning in an evolving 4D world. These findings underscore the importance of explicit spatio-temporal structural cues to bridge the gap between static perception and dynamic reasoning in MLLMs.
PaperID: 997,   Poster  https://arxiv.org/pdf/2604.03657     GitHub
Authors: Tianci Luo, Haohao Pan, Jinpeng Wang, Niu Lian, Xinrui Chen, Bin Chen, Shu-Tao Xia, Chun Yuan
Title: Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
Abstract: Visual incontext learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge.Prior work has made substantial progress on prompt retrieval and reranking strategies; however, they focused primarily on prompt images while often overlooking the labels. We reveal that these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-awarePromptRetrieval), which highlights the role of labels in prompt selection. Our framework first designs an image–label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation.We carefully design alternative optimization for experts and the router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Faithful code will be released publicly.
PaperID: 998,   Poster  https://arxiv.org/pdf/2603.25722     GitHub
Authors: Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez
Title: No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Abstract: Contrastive visionlanguage (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard-negative samples. Hard-negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders leads to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost.
PaperID: 999,   Poster  https://arxiv.org/pdf/2603.23383     GitHub
Authors: Feifan Luo, Hongyang Chen
Title: From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Abstract: Shape matching is a fundamental task in computer graphics and vision, with deep functional map methods emerging as a preferred solution. However, existing approaches primarily focus on learning informative feature representations by constraining both pointwise and functional maps, while overlooking the optimization of a crucial component: the spectral basis, which plays a key role in the (deep) functional maps pipeline. This oversight leads to suboptimal matching performance. Furthermore, these approaches mostly rely on conventional functional map techniques, such as timeconsuming functional map solvers, which incur substantial computational overhead. To address those, we introduce Advanced Functional Maps, which generalizes standard functional maps from fixed basis functions to learnable basis functions, supported by rigorous theoretical guarantees. In this framework, the spectral basis is optimized by learning a set of inhibition functions. Building on this foundation, we propose the first unsupervised spectral basis learning method for efficient and robust non-rigid 3D shape matching, simultaneously optimizing feature extraction and basis functions in an end-to-end manner. A novel heat diffusion module and a new unsupervised loss function are introduced for basis learning, along with a simple yet efficient architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms current state-of-the-art feature-learning-based functional map approaches, especially in challenging non-isometric and topological noise matching scenarios, all while maintaining high computational efficiency. Finally, we demonstrate that optimizing basis functions is equivalent to spectral convolution, with inhibition functions acting as filters. This insight enables enhanced spectral basis representations by designing novel inhibition functions inspired by spectral graph/manifold convolutional networks, opening new avenues for future research.
PaperID: 1000,   Poster  https://arxiv.org/pdf/2506.09217     GitHub
Authors: Boyu Jiang, Liang Shi, Zhengzhi Lin, Lanxin Xiang, Loren Stowe, Feng Guo
Title: Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
Abstract: The safety of autonomous driving systems (ADS) depends on accurate perception across distance and driving conditions. The outputs of AI perception algorithm are stochastic, which have a major impact on decision making and safety outcomes, including timeto-collision estimation. However, current perception evaluation metrics do not reflect the stochastic nature of perception algorithms. We introduce the Perception Characteristics Distance (PCD), a novel metric incorporating model output uncertainty as represented by the farthest distance at which an object can be reliably detected. To represent a system’s overall perception capability in terms of reliable detection distance, we used the averaging PCD values across multiple detection quality and probabilistic thresholds produces the average PCD (aPCD). For empirical validation, we present the SensorRainFall dataset, collected on the Virginia Smart Road using a sensor-equipped vehicle (cameras, radar, and LiDAR) controlled under different weather (clear and rainy) and illumination conditions (daylight, streetlight, and nighttime). The dataset includes ground-truth distances, bounding boxes, and segmentation masks for target objects. Experiments with state-of-the-art models show that aPCD captures meaningful differences across weather, daylight, and illumination conditions, which traditional evaluation metrics fail to reflect. PCD provides an uncertainty-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation.
PaperID: 1001,   Poster  https://arxiv.org/pdf/2512.10938    
Authors: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Title: Stronger Normalization-Free Transformers
Abstract: Although normalization layers have long been viewed as essential components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. Acting like a normalization layer, the pointwise function DyT constrains extreme values for stable convergence and reach normalization-level performance; this work seeks further for functions that can surpass it. We first study how the intrinsic properties of point-wise functions shape training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce \mathrmDerf(x) = \mathrmerf(\alpha x + s) and identify it as the most performant design. \methodname consistently outperforms LayerNorm, RMSNorm, and Dynamic Tanh across a wide range of modalities, tasks, and learning paradigms. Moreover, our findings suggest that the performance gains of \methodname largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and performance make \methodname a practical choice for normalization-free Transformer design.
PaperID: 1002,   Poster  https://arxiv.org/pdf/2603.29239    
Authors: Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn
Title: Diffusion Mental Averages
Abstract: Can a diffusion model produce its own “mental average” of a concept—one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a modelcentric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.
PaperID: 1003,   Poster  https://arxiv.org/pdf/2512.17907    
Authors: Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo
Title: Dexterous World Models
Abstract: Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an sceneaction-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic human–scene interaction dataset and real-world object manipulation dataset, then evaluate it across both synthetic and real-world egocentric benchmarks. Experiments demonstrate that DWM enables realistic, physically grounded interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation and 3D scene interactivity from egocentric actions.
PaperID: 1004,   Poster  https://arxiv.org/pdf/2512.21218    
Authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig
Title: Latent Visual Reasoning
Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely textcentric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, this strategy imposes restrictive priors on ``useful'' visual abstractions, creates heavy annotation costs, and struggles to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- in addition to demonstrating strong cross-task generalization.
PaperID: 1005,   Poster  https://arxiv.org/pdf/2601.12796    
Authors: Changwei Jing, Jai Bandi, Jianglong Ye, Rocky Duan, Pieter Abbeel, Xiaolong Wang, Sha Yi
Title: Contact-Aware Neural Dynamics
Abstract: Highfidelity physics simulation is essential for scalable robotic learning, but the sim-to-real gap persists, especially for tasks involving complex, dynamic, and discontinuous interactions like physical contacts. Explicit system identification, which tunes explicit simulator parameters, is often insufficient to align the intricate, high-dimensional, and state-dependent dynamics of the real world. To overcome this, we propose an implicit sim-to-real alignment framework that learns to directly align the simulator's dynamics with contact information. Our method treats the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. We show that using tactile contact information from robotic hands can effectively model the non-smooth discontinuities inherent in contact-rich tasks, resulting in a neural dynamics model grounded by real-world data. We demonstrate that this learned forward dynamics model improves state prediction accuracy and can be effectively used to predict policy performance and refine policies trained purely in standard simulators, offering a scalable, data-driven approach to sim-to-real alignment.
PaperID: 1006,   Poster  https://arxiv.org/pdf/2512.06255    
Authors: Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Peng-Fei Zhang, Zi Huang
Title: Language-driven Fine-grained Retrieval
Abstract: Existing finegrained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision–language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer it toward pinpointing visual details consistent with linguistic descriptions, thus modeling comparability among object details. Extensive evaluations show that LaFG achieves impressive performance on both fine- and coarse-grained benchmarks and generalizes well to unseen categories.
PaperID: 1007,   Poster  https://arxiv.org/pdf/2601.05239    
Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin
Title: Plenoptic Video Generation
Abstract: Cameracontrolled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera estimation, and diverse view transformations (e.g., third-person → third-person, and head-view → gripper-view in robotic manipulation).
PaperID: 1008,   Poster  https://arxiv.org/pdf/2506.13212    
Authors: Filippo Maggioli, Simone Melzi, Marco Livesu
Title: Volumetric Functional Maps
Abstract: The computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for highquality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum greatly improves the accuracy for classical shape matching tasks among surfaces, consistently outperforming existing surface-only spectral methods.
PaperID: 1009,   Poster  https://arxiv.org/pdf/2512.05859    
Authors: Abhijith Punnappurath, Luxi Zhao, Ke Zhao, Hue Nguyen, Radek Grzeszczuk, Michael S. Brown
Title: Edit-aware RAW reconstruction
Abstract: Users frequently edit camera images postcapture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera’s display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5–2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.
PaperID: 1010,   Poster  https://arxiv.org/pdf/2601.05083    
Authors: Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Éloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh Quan Cao, Nermin Samet, TUAN-HUNG VU, Matthieu Cord
Title: Driving on Registers
Abstract: We present DrivoR, a simple and efficient transformerbased architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available.
PaperID: 1011,   Poster  https://arxiv.org/pdf/2512.08443    
Authors: Hithem Lamri, Michail Maniatakos
Title: Fully Decentralized Certified Unlearning
Abstract: Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and serverorchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting—where peers communicate without a coordinator—remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose \methodname, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) (\varepsilon,\delta) network-unlearning certificates on client views via subsampled Gaussian R\'enyi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy–utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), \methodname~ matches a given (\varepsilon,\delta) while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing (\(\approx 10%\)).
PaperID: 1012,   Poster  https://arxiv.org/pdf/2601.12527    
Authors: Richard Liu, Itai Lang, Rana Hanocka
Title: Deep Feature Deformation Weights
Abstract: Handlebased mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. Importantly, these weights can be computed in real-time for any surface point, whereas all prior methods require optimization of these weights. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which leverages the full visual signal from shape renders to make the compute cost robust to mesh resolution. This allows deep feature weights to be computed for even high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend existing functionality of classical methods through feature space constraints and locality weighting.Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.
PaperID: 1013,   Poster  https://arxiv.org/pdf/2510.26782    
Authors: Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen
Title: Clone Deterministic 3D Worlds
Abstract: A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for highfidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.
PaperID: 1014,   Poster  https://arxiv.org/pdf/2510.14737    
Authors: Seulki Park, Zilin Wang, Stella X. Yu
Title: Free-Grained Hierarchical Visual Recognition
Abstract: Hierarchical image recognition predicts labels across a semantic taxonomy, but existing methods typically assume complete, finegrained labels, an assumption rarely met in practice. Real-world annotations vary in granularity due to image quality, annotator expertise, and task goals; a distant bird may be labeled "Bird'', while a close-up reveals "Bank Swallow''. We formalize this realistic setting as free-grain learning, where each image may be labeled at any taxonomy level, while the model must still learn the full hierarchical path. To study this problem, we build diverse benchmarks that provide labels at varying semantic granularity, including a new three-level ImageNet-F and mixed-granularity variants of datasets. We further develop strong baselines that improve learning under mixed supervision through (1) semantic guidance from vision–language models and (2) visual guidance via semi-supervised learning. Together, our benchmarks and methods advance hierarchical recognition under real-world constraints.
PaperID: 1015,   Poster  https://arxiv.org/pdf/2511.13183    
Authors: Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander
Title: GenTract: Generative Global Tractography
Abstract: Tractography is the process of inferring the trajectories of whitematter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract’s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1× higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.
PaperID: 1016,   Poster  https://arxiv.org/pdf/2512.13684    
Authors: Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew Hudson, Joao Carreira, Andrew Zisserman
Title: Recurrent Video Masked Autoencoders
Abstract: We present Recurrent Video MaskedAutoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30× greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that \model's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.
PaperID: 1017,   Poster  https://arxiv.org/pdf/2601.01608    
Authors: Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer
Title: Guiding Token-Sparse Diffusion Models
Abstract: Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifierfree Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.
PaperID: 1018,   Poster  https://arxiv.org/pdf/2510.27439    
Authors: Yanlong Yang, Guanxiong Luo
Title: Self-Diffusion Driven Blind Imaging
Abstract: Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce nonideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.
PaperID: 1019,   Poster  https://arxiv.org/pdf/2505.12734    
Authors: Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang
Title: SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
Abstract: Recent audioto-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on acoustic environments. To address this challenge, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We further propose SounDiT, a diffusion transformer (DiT)-based model that incorporates acoustic environments and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.
PaperID: 1020,   Poster  https://arxiv.org/pdf/2604.13074    
Authors: Chang Nie, Chaoyou Fu, YiFan Zhang, HaiHuaYang HaiHuaYang, Caifeng Shan
Title: Long-Term Personalized Multimodal LLMs
Abstract: Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited.Prior approaches enable only static, singleturn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1).In this paper, we introduce Pal-R3, an innovative personalized multimodal agent framework designed for long-term personalization.Pal-R3 transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities:(a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database.(b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics.For evaluation, we establish MME-P, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks.Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (MME-P) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively.Our code is available in the supplementary materials.
PaperID: 1021,   Poster  https://arxiv.org/pdf/2503.10626    
Authors: Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael J. Black
Title: NIL: No-data Imitation Learning
Abstract: Acquiring physically plausible motor skills across diverse and unconventional embodiments, including humanoids and quadrupeds, is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL), require extensive reward function engineering. Imitation learning (IL) offers an alternative but relies heavily on curated 3D expert demonstrations, which are scarce and difficult to obtain for nonhuman morphologies. Video diffusion models, on the other hand, are capable of generating realistic-looking videos of various morphologies, from humans to ants. However, these videos are often not physically plausible, which limits their direct use for skill acquisition. We introduce "No-data Imitation Learning" (NIL): an imitation learning framework that replaces curated expert demonstrations with videos generated by a pretrained video diffusion model. Our key insight is that the physics simulator enforces physical constraints, while the video provides visual guidance. NIL learns 3D motor skills in a physics simulator from 2D-generated videos, with generalization capability to unconventional forms. Specifically, NIL computes a discriminator-free imitation reward that combines (i) a video-embedding similarity between generated and simulated videos using a pretrained video vision transformer, and (ii) an image-based similarity term derived from video segmentation masks. We evaluate NIL on locomotion and whole-body control tasks across unique body configurations. Our experiments show that in humanoid locomotion, NIL matches the performance of state-of-the-art IL baselines trained on motion-capture data; and in whole-body manipulation, it exceeds the performance of RL baselines without requiring any curated data.
PaperID: 1022,   Poster  https://arxiv.org/pdf/2602.14376    
Authors: Yuliang Wu, Wei Zhai, Yuxin Cui, Tiesong Zhao, Yang Cao, Zheng-Jun Zha
Title: Event-based Visual Deformation Measurement
Abstract: Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional imagebased methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead.We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation.By revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework that partitions the deformation field into multiple sub-regions and linearize the deformation within each sub-region using a low-parametric representation, effectively mitigating motion ambiguities arising from the sparse and noisy nature of event observations. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking.To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and high-frame-rate videos is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that the proposed method outperforms the state-of-the-art baseline by 1.6× in terms of continuous measurement success rate (survival rate). Remarkably, our approach achieves superior performance while requiring only 18.9% of the data storage and processing resources compared to traditional high-speed video-based methods, without compromising accuracy.
PaperID: 1023,   Poster  https://arxiv.org/pdf/2603.19567    
Authors: Zhenyu Yang, Gensheng Pei, Tao Chen, Yichao Zhou, Tianfei Zhou, Yazhou Yao, Fumin Shen
Title: Efficiency Follows Global-Local Decoupling
Abstract: Modern vision models must capture imagelevel context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
PaperID: 1024,   Poster  https://arxiv.org/pdf/2603.27773    
Authors: Maolin Gao, Shao Hu-Chen, Congyue Deng, Riccardo Marin, Leonidas Guibas, Daniel Cremers
Title: RINO: Rotation-Invariant Non-Rigid Correspondences
Abstract: Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under nonisometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.
PaperID: 1025,   Poster  https://arxiv.org/pdf/2512.21334    
Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
Title: Streaming Video Instruction Tuning
Abstract: We present Streamo, a realtime streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
PaperID: 1026,   Poster  https://arxiv.org/pdf/2512.05478    
Authors: Jingyuan Yang, Zihuan Bai, Hui Huang
Title: EmoStyle: Emotion-Driven Image Stylization
Abstract: Art has long been a profound medium for expressing emotions.While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles.To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content.We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion–style mapping.First, we construct EmoStyleSet, a contentemotion-stylized image triplet dataset derived from ArtEmis to support AIS.We then propose an Emotion–Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries.Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries.Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency.Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications.Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.
PaperID: 1027,   Poster  https://arxiv.org/pdf/2503.08485    
Authors: Fengyi Zhang, Xiangyu Sun, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
Title: Test-Time 3D Occupancy Prediction
Abstract: Selfsupervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions. Source code is available in the supplementary materials.
PaperID: 1028,   Poster  https://arxiv.org/pdf/2504.05741    
Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang
Title: DDT: Decoupled Diffusion Transformer
Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lowerfrequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \colorddtDecoupled \colorddtDiffusion \colorddtTransformer(\colorddtDDT), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet 256×256, Our DDT-XL/2 achieves a new state-of-the-art performance of 1.31 FID~(nearly 4× faster training convergence compared to previous diffusion transformers). For ImageNet 512×512, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
PaperID: 1029,   Poster  https://arxiv.org/pdf/2602.21341    
Authors: Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann
Title: Scaling View Synthesis Transformers
Abstract: Recently, geometryfree view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder–decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder–decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder–decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance–compute Pareto frontier, and outperforms the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.
PaperID: 1030,   Poster  https://arxiv.org/pdf/2511.15487    
Authors: Chen Zhang, Wei Zuo, Bingyang Cheng, Yikun Wang, Wei-Bin Kou, Yik-Chung WU, Ngai Wong
Title: NTK-Guided Implicit Neural Teaching
Abstract: Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolutionindependent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
PaperID: 1031,   Poster  https://arxiv.org/pdf/2602.06139    
Authors: Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, zhipeng cai
Title: EgoAVU: Egocentric Audio-Visual Understanding
Abstract: Understanding egocentric videos plays a vital role for embodied intelligence. Recent multimodal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph based curation ensure both the data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct — a large scale training dataset of 3M samples, and EgoAVU-Bench — a manually verified evaluation split of 3K samples. EgoAVU-Bench clearly reveals the limitation of existing MLLMs: they bias heavily towards visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively solves this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefit can also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
PaperID: 1032,   Poster  https://arxiv.org/pdf/2603.21784    
Authors: Woohyeok Kim, Jaesung Rim, Daeyeon Kim, Sunghyun Cho
Title: Dynamic Exposure Burst Image Restoration
Abstract: Burst image restoration aims to reconstruct a highquality image from burst images, which are typically captured using manually designed exposure settings.Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked.In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment.In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain.Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times.For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality. The code will be made publicly available on our project page.
PaperID: 1033,   Poster  https://arxiv.org/pdf/2511.17339    
Authors: Yassir Bendou, Omar Ezzahir, Eduardo Fernandes Montesuma, Gabriel Mahuas, Victoria Shevchenko, Mike Gartrell
Title: ReBaPL: Repulsive Bayesian Prompt Learning
Abstract: Prompt learning has emerged as an effective technique for finetuning large-scale foundation models for downstream tasks. However, conventional prompt tuning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art methods for prompt learning.
PaperID: 1034,   Poster  https://arxiv.org/pdf/2602.21402    
Authors: Jinyoung Jun, Wondong Jang, Wenbin Ouyang, Raghudeep Gadde, Jungbeom Lee
Title: FlowFixer: Towards Detail-Preserving Subject-Driven Generation
Abstract: In this paper, we present FlowFixer, a refinement framework for subjectdriven generation (SDG) that restores fine details lost during scene generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.
PaperID: 1035,   Poster  https://arxiv.org/pdf/2603.27300    
Authors: Weibang Wang, Kenan Li, Zhuoguang Chen, Yijun Yuan, Hang Zhao
Title: Complet4R: Geometric Complete 4D Reconstruction
Abstract: We introduce Complet4R, a novel endto-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support future research.
PaperID: 1036,   Poster  https://arxiv.org/pdf/2511.22659    
Authors: Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng
Title: Geometrically-Constrained Agent for Spatial Reasoning
Abstract: Vision Language Models (VLMs) exhibit a fundamental semanticto-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for complex spatial tasks. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%.
PaperID: 1037,   Poster  https://arxiv.org/pdf/2603.14741    
Authors: Seung Young Noh, Ju Yong Chang
Title: PHAC: Promptable Human Amodal Completion
Abstract: Conditional image generation methods are increasingly used in humancentric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model to obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on standard HAC and PGPIS benchmarks show that our approach produces more physically plausible, higher-quality completions with significantly improved prompt alignment compared to existing amodal completion and pose-guided synthesis methods.
PaperID: 1038,   Poster  https://arxiv.org/pdf/2512.04076    
Authors: Alexander Mai, Trevor Hedstrom, George Kopanas, Janne Kontkanen, Falko Kuester, Jonathan T. Barron
Title: Radiance Meshes for Volumetric Reconstruction
Abstract: We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and raytracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.
PaperID: 1039,   Poster  https://arxiv.org/pdf/2509.09666    
Authors: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Haochen Wang, Zhendong Wang, Bin Lin, Li Hao, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Title: Unified Multimodal Models as Auto-Encoders
Abstract: Imageto-text (I2T) understanding and text-to-image (T2I) generation are two fundamental, important yet traditionally isolated multimodal tasks. Despite their intrinsic connection, existing approaches typically optimize them independently, missing the opportunity for mutual enhancement. In this paper, we argue that both tasks can be connected under a shared Auto-Encoder perspective, where text serves as the intermediate latent representation, bridging the two directions — encoding images into textual semantics (I2T) and decoding text back into images (T2I). Our key insight is thatif the encoder truly "understands" the image, it should capture all essential structure, and if the decoder truly "understands" the text, it should recover that structure faithfully.Building upon this principle, we propose Unified-GRPO, a post-training method based on reinforcement learning that jointly optimizes both modules through reconstructive rewards, maximizing the semantic consistency between the input and the generated images. Under this reconstruction objective, the encoder is encouraged to extract as much accurate and comprehensive semantic information from the input image to maximize reconstruction quality, while the decoder is simultaneously optimized to generate conditioned on the encoder's prior, enabling a self-evolving improvement. Empirically, we find that using text as the intermediate representation and training under a reconstructive RL paradigm effectively benefits both I2T and T2I.The I2T module gains stronger fine-grained visual perception, such as small-object recognition, grounding, etc, while its dense embeddings and language priors, in turn, provide richer semantic signals that improve T2I fidelity and complex instruction following. These results demonstrate that the reconstructive RL establishes a mutually reinforcing cross-modal synergy within the auto-encoding framework.
PaperID: 1040,   Poster  https://arxiv.org/pdf/2602.19063    
Authors: QUAN LIU, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu
Title: Direction-aware 3D Large Multimodal Models
Abstract: 3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional questionanswering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMM with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object–frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.
PaperID: 1041,   Poster  https://arxiv.org/pdf/2601.02046    
Authors: Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
Title: Agentic Retoucher for Text-To-Image Generation
Abstract: Textto-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision–language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we proposeAgentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-likeperception–reasoning–actionloop.Specifically, we design (1) aperception agentthat learns contextual saliency for fine-grained distortion localization under text–image consistency cues, (2) areasoning agentthat performs human-aligned inferential diagnosis via progressive preference alignment, and (3) anaction agentthat adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further constructGenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories.Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
PaperID: 1042,   Poster  https://arxiv.org/pdf/2604.17190    
Authors: Yuwei Ning, Ganlong Zhao, Yipeng Qin, Si Liu, Yang Liu, Liang Lin, Guanbin Li
Title: LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
Abstract: Aerial Visionand-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments.While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues—a key source of spatial context in human navigation.In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning.Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
PaperID: 1043,   Poster  https://arxiv.org/pdf/2603.07240    
Authors: Yingjie Tang, Di Luo, Zixiong Wang, Xiaoli Ling, Jian Yang, Beibei Wang
Title: FabricGen: Microstructure-Aware Woven Fabric Generation
Abstract: Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pretrained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.
PaperID: 1044,   Poster  https://arxiv.org/pdf/2506.09839    
Authors: Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu
Title: OctoNav: Towards Generalist Embodied Navigation
Abstract: Embodied navigation stands as a foundation pillar within the pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods designed individually. In this work, we take steps toward generalist navigation, which can follow freeform instructions that include arbitrary compounds of modality and capability.To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GRPO, and Online RL stages. Each stage contains designed learning policies and rewards. Specifically, inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer, we design TBA-SFT and Nav-GRPO to achieve thinking-before-action for embodied navigation, improving model's reasoning ability toward generalists.TBA-SFT utilizes the TBA-CoT dataset to fine-tune the model, and then we leverage Nav-GRPO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.
PaperID: 1045,   Poster  https://arxiv.org/pdf/2602.19708    
Authors: Hoyoung Kim, Minwoo Jang, Jabin Koo, Sangdoo Yun, Jungseul Ok
Title: ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
Abstract: Beyond general recognition tasks, specialized domains including privacyconstrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA A for class priors and per-image LoRAs \mathcalB for image-specific characteristics. To expose coherent class semantics in the shared LoRA A, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose A with a mixture of \mathcalB using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.
PaperID: 1046,   Poster  https://arxiv.org/pdf/2512.10421    
Authors: Xiao Chen, Zhongjing Du, Jiazhen Huang, Jiang Xu, Li Lu, Jingyan Jiang, Zhi Wang
Title: Neural Collapse in Test-Time Adaptation
Abstract: TestTime Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample’s feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.
PaperID: 1047,   Poster  https://arxiv.org/pdf/2603.26586    
Authors: Kun Li, Jihao Gu, Fei Wang, zhiliang wu, Hehe Fan, Dan Guo
Title: MA-Bench: Towards Fine-grained Micro-Action Understanding
Abstract: With the rapid development of Multimodal Large Language Models (MLLMs), their potential in MicroAction understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question–answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further constructMA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors.
PaperID: 1048,   Poster  https://arxiv.org/pdf/2604.18867    
Authors: Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz
Title: Hierarchically Robust Zero-shot Vision-Language Models
Abstract: VisionLanguage Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., \textttmammal) in addition to their base (leaf) classes (e.g., \textttcat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.
PaperID: 1049,   Poster  https://arxiv.org/pdf/2507.07685    
Authors: Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa
Title: Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Abstract: Large visionlanguage models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.
PaperID: 1050,   Poster  https://arxiv.org/pdf/2512.11798    
Authors: Ruining Li, YUXIN YAO, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi
Title: Particulate: Feed-Forward 3D Object Articulation
Abstract: We introduce Particulate, a feedforward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters.Unlike prior work on articulated 3D object modeling that is limited by costly per-object optimization and small retrieval databases or requires large vision or language foundation models, our approach is based on a flexible, scalable and lightweight transformer architecture.Trained on a diverse collection of articulated 3D assets from public datasets, Particulate accurately infers the articulated structure of novel objects, including those generated by image-to-3D models, in a single feed-forward pass.We further introduce a benchmark for articulated 3D object estimation curated from high-quality public 3D assets.Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.
PaperID: 1051,   Poster  https://arxiv.org/pdf/2603.07704    
Authors: Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu
Title: PARSE: Part-Aware Relational Spatial Modeling
Abstract: Interobject relations underpin spatial intelligence, yet existing representations—linguistic prepositions or object-level scene graphs—are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatial grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
PaperID: 1052,   Poster  https://arxiv.org/pdf/2503.10120    
Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Title: Hybrid Agents for Image Restoration
Abstract: Existing Image Restoration (IR) studies typically focus on taskspecific or universal modes individually, relying on the mode selection of users and lacking cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.
PaperID: 1053,   Poster  https://arxiv.org/pdf/2512.04222    
Authors: Alara Dirik, Tuanfeng Y. Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück
Title: ReasonX: MLLM-Guided Intrinsic Image Decomposition
Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusionand transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge’s relational assessments and analytically derived relations from the model’s outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9–25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.
PaperID: 1054,   Poster  https://arxiv.org/pdf/2506.06909    
Authors: Vladimir Yugay, Thies Kersten, Luca Carlone, Theo Gevers, Martin R. Oswald, Lukas Schmid
Title: Gaussian Mapping for Evolving Scenes
Abstract: Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splattingbased systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We thoroughly evaluate Gaussian Mapping for Evolving Scenes (\ours) on both synthetic and real-world datasets, achieving a 29.7% improvement in PSNR and a ×3-improvement in L1 depth error over the most competitive baseline.
PaperID: 1055,   Poster  https://arxiv.org/pdf/2603.03956    
Authors: Jinkun You, Jiaxin Cheng, Jie Zhang, Yicong Zhou
Title: Towards Generalized Multimodal Homography Estimation
Abstract: Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with groundtruth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.
PaperID: 1056,   Poster  https://arxiv.org/pdf/2511.19428    
Authors: Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola
Title: Flow Map Distillation Without Data
Abstract: Stateof-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution—a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256×256, and 1.49 on ImageNet 512×512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
PaperID: 1057,   Poster  https://arxiv.org/pdf/2512.09164    
Authors: Jin Cao, Hong-Xing Yu, Jiajun Wu
Title: WonderZoom: Multi-Scale 3D World Generation
Abstract: We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to singlescale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to ``zoom into'' a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds on the website of the supplementary materials.
PaperID: 1058,   Poster  https://arxiv.org/pdf/2502.10389    
Authors: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang
Title: Region-Adaptive Sampling for Diffusion Transformers
Abstract: Diffusion models (DMs) have become the stateof-the-art for generative tasks across domains, but their reliance on sequential forward passes limits real-time performance. Prior acceleration methods mainly reduce sampling steps or reuse intermediate results. Leveraging the flexibility of Diffusion Transformers (DiTs) to handle variable token counts, we propose RAS, a training-free sampling strategy that dynamically assigns different update ratios to image regions based on model focus. Our key observation is that at each step, DiTs concentrate on semantically meaningful areas, and these regions exhibit strong continuity across consecutive steps. Exploiting this, RAS updates only focused regions while reusing cached noise for others, with focus determined from the previous step’s output. Evaluated on Stable Diffusion 3 and Lumina-Next-T2I, RAS achieves up to 2.36× and 2.51× speedups, respectively, with minimal quality loss. This demonstrates a practical step toward more efficient diffusion transformers for real-time generation.
PaperID: 1059,   Poster  https://arxiv.org/pdf/2603.06408    
Authors: Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, Christian Theobalt
Title: Physical Simulator In-the-Loop Video Generation
Abstract: Recent advances in diffusionbased video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-Loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity.
PaperID: 1060,   Poster  https://arxiv.org/pdf/2603.26657    
Authors: Md Ashiqur Rahman, Lim Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh
Title: Tunable Soft Equivariance with Guarantees
Abstract: Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in realworld data, which can limit a model’s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.
PaperID: 1061,   Poster  https://arxiv.org/pdf/2512.06865    
Authors: Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, chen chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang
Title: Spatial Retrieval Augmented Autonomous Driving
Abstract: Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drivetime perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this "recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD stacks.For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.
PaperID: 1062,   Poster  https://arxiv.org/pdf/2512.03336    
Authors: Alan T. L. Bacellar, Mustafa Munir, Felipe M.G. França, Priscila Machado Vieira Lima, Radu Marculescu, Lizy Kurian John
Title: Single-Round Scalable Analytic Federated Learning
Abstract: Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (nonIID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.
PaperID: 1063,   Poster  https://arxiv.org/pdf/2510.15398    
Authors: Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Title: MARIS: Marine Open-Vocabulary Instance Segmentation
Abstract: Most existing underwater instance segmentation approaches are constrained by closevocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduceMARIS(Marine Open-Vocabulary Instance Segmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) Instance segmentation (UOVIS), featuring a limited set of seen categories and diverse unseen categories. Although OV instance segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (GPEM) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (SAIM) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research. The code of this paper can be found in the supplementary materials.
PaperID: 1064,   Poster  https://arxiv.org/pdf/2602.23618    
Authors: Peng Dai, Yu Zhang, Feng Yiqiang, ZhenFan Fan, Yang Zhang
Title: Egocentric Visibility-Aware Human Pose Estimation
Abstract: Egocentric human pose estimation (HPE) using a headmounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.
PaperID: 1065,   Poster  https://arxiv.org/pdf/2512.06662    
Authors: Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Nguyen Nguyen, Dimitris Samaras
Title: Personalized Image Descriptions from Attention Sequences
Abstract: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription–PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attentionprediction task. A lightweight adapter aligns these embeddings with a frozen vision–language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multi-modal systems.
PaperID: 1066,   Poster  https://arxiv.org/pdf/2511.20643    
Authors: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge
Title: Concept-Aware Batch Sampling Improves Language-Image Pretraining
Abstract: What data should a visionlanguage model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional dataset bias. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce CABS (Concept-Aware Batch Sampling), a simple yet effective batch-sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization, to curate batches with the broadest coverage of available concepts, and (ii) CABS-FM (Frequency Maximization), to curate batches with maximal object multiplicity. Through extensive evaluations with four visual backbones and a suite of 28 benchmarks, we demonstrate that CABS significantly benefits Language-Image Pretraining (LIP) and yields highly performant models on long-tailed evaluations. Overall, CABS represents a strong open-source alternative to proprietary online curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks. Both DataConcept and the source code for CABS will be made public.
PaperID: 1067,   Poster  https://arxiv.org/pdf/2512.10935    
Authors: Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan
Title: Any4D: Unified Feed-Forward Metric 4D Reconstruction
Abstract: We present Any4D, a scalable multiview transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster) - opening avenues for multiple downstream applications.
PaperID: 1068,   Poster  https://arxiv.org/pdf/2510.12225    
Authors: Hritik Bansal, Devendra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, Ramakanth Pasunuru
Title: HoneyBee: Data Recipes for Vision-Language Reasoners
Abstract: Recent advances in visionlanguage models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.
PaperID: 1069,   Poster  https://arxiv.org/pdf/2512.05268    
Authors: Niki Nezakati, Arnab Ghosh, Amit Roy-Chowdhury, Vishwanath Saragadam
Title: CARD: Correlation Aware Restoration with Diffusion
Abstract: Denoising diffusion models have achieved stateof-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM's closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.
PaperID: 1070,   Poster  https://arxiv.org/pdf/2604.10927    
Authors: Muhammad Usama Saleem, Mayur Jagdishbhai Patel, Ekkasit Pinyoanuntapong, Zhongxing Qin, Li Yang, Hongfei Xue, Ahmed Helmy, Chen Chen, Pu Wang
Title: LiveGesture: Streamable Co-Speech Gesture Generation Model
Abstract: We propose LiveGesture, the first fully streamable, speechdriven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods—which are designed for offline generation and either treat body regions independently or entangle all joints within a single model—LiveGesture is built from the ground up for causal, region-coordinated motion generation. \emphLiveGesture consists of two main modules: the Streamable Vector-Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-eXpert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR-Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR-Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero–look-ahead conditions
PaperID: 1071,   Poster  https://arxiv.org/pdf/2603.25942    
Authors: Peiyao Wang, Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Fu Fu
Title: Reinforcing Structured Chain-of-Thought for Video Understanding
Abstract: Multimodal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods often depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLM’s ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize →Think →Answer. SDRL introduces two self-supervised mechanisms integrated into GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets. Additionally, we construct and will release an 80K VideoQA dataset with temporal annotations.
PaperID: 1072,   Poster  https://arxiv.org/pdf/2602.20989    
Authors: Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-Or, Hui Huang
Title: Cycle-Consistent Tuning for Layered Image Decomposition
Abstract: Disentangling visual layers in realworld images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.
PaperID: 1073,   Poster  https://arxiv.org/pdf/2510.27316    
Authors: Zijia An, boyu diao, RuiQi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu
Title: Parameterized Prompt for Incremental Object Detection
Abstract: Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Our study reveals that existing promptspool-based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent confusion and catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P^2IOD). Leveraging neural networks global evolution properties, P^2IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P^2IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P^2IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
PaperID: 1074,   Poster  https://arxiv.org/pdf/2512.07514    
Authors: JunKai Lin, Hang Long, Huipeng Guo, Jielei Zhang, JiaYi Yang, Tianle Guo, Yang Yang, Jianwen Li, Wenxiao ZHANG, Matthias Nießner, Wei Yang
Title: MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Abstract: Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with slidingwindow inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
PaperID: 1075,   Poster  https://arxiv.org/pdf/2509.03113    
Authors: Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez
Title: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Abstract: Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text–visual bias, the overreliance on prompts and prior outputs, and cooccurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens—visual features and text tokens—to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
PaperID: 1076,   Poster  https://arxiv.org/pdf/2511.22533    
Authors: Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang
Title: Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent cachingbased methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).
PaperID: 1077,   Poster  https://arxiv.org/pdf/2602.12279    
Authors: Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung, Felix Juefei-Xu
Title: UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Abstract: Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While testtime scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge.We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
PaperID: 1078,   Poster  https://arxiv.org/pdf/2512.18640    
Authors: Kai Kohyama, Yoshimitsu Aoki, Guillermo Gallego, Shintaro Shiba
Title: Geometric-Photometric Event-based 3D Gaussian Ray Tracing
Abstract: Event cameras offer a high temporal resolution over traditional framebased cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event accumulation size, and achieves sharp reconstruction on scene edges. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. We will release the code upon acceptance.
PaperID: 1079,   Poster  https://arxiv.org/pdf/2603.27059    
Authors: Zhihao Zhang, Abhinav Kumar, Xiaoming Liu
Title: Towards Intrinsic-Aware Monocular 3D Object Detection
Abstract: Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image.Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic changes reshape how 3D scenes are projected onto the image plane.We propose MonoIA, a unified intrinsicaware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry.To capture this effect, MonoIA employs large language models and vision–language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics.This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras.Extensive experiments show that MonoIA achieves new state-of-the-art (SoTA) results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.0% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).
PaperID: 1080,   Poster  https://arxiv.org/pdf/2602.13602    
Authors: Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu
Title: Towards Sparse Video Understanding and Reasoning
Abstract: We presentReViSe(Reasoning withVideoSparsity), a framework that combines multiround reasoning with adaptive frame selection for video question answering (VQA). Existing vision–language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy. In contrast, ReViSeinteractively selects informative frames through multi-round reasoning. To achieve this, ReViSe includes three modules: a multi-round conversation module that retains frame selection history as memory; a reasoning tracer that maintains a chain-of-thought across rounds; and a self-correction mechanism that enforces structural and behavioral validity. ReViSe integrates seamlessly with both proprietary and open-source VLMs. It supports proprietary models in a “plug-and-play” manner and enables reinforcement fine-tuning for open-source models. Experiments on multiple VQA benchmarks show thatReViSeimproves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.
PaperID: 1081,   Poster  https://arxiv.org/pdf/2603.28064    
Authors: Renjie Wu, Hongdong Li, Jose M. Alvarez, Miaomiao Liu
Title: $\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction
Abstract: This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GSbased dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``4DSurf'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49% and 19% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.
PaperID: 1082,   Poster  https://arxiv.org/pdf/2603.18943    
Authors: Jiayi Yuan, Haobo Jiang, De Soh Soh, Na Zhao
Title: VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
Abstract: This paper presents VGGT360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that together form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT’s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
PaperID: 1083,   Poster  https://arxiv.org/pdf/2512.00565    
Authors: Nicolas Gorlo, Lukas Schmid, Luca Carlone
Title: Describe Anything Anywhere At Any Moment
Abstract: Computer vision and robotics applications ranging from agumented reality to robot autonomy in largescale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D.To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding.DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing.It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation.DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning.We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering (SQA) on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations.DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively.We release our data and code open-source.
PaperID: 1084,   Poster  https://arxiv.org/pdf/2604.03799    
Authors: Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao
Title: Next-Scale Autoregressive Models for Text-to-Motion Generation
Abstract: Autoregressive (AR) models offer stable and efficient training, but standard nexttoken prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text–motion data, we further incorporate cross-scale hierarchical refinement for improving coarse-scale predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves state-of-the-art text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.
PaperID: 1085,   Poster  https://arxiv.org/pdf/2511.18964    
Authors: Antonia Wüst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami, Kristian Kersting
Title: Synthesizing Visual Concepts as Vision-Language Programs
Abstract: VisionLanguage models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLPs), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLPs leverage the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.
PaperID: 1086,   Poster  https://arxiv.org/pdf/2601.11194    
Authors: Boyi Pang, Savva Ignatyev, Vladimir Ippolitov, Ramil Khafizov, Yurii Melnik, Oleg Voynov, Maksim Nakhodnov, Aibek Alanov, Xiaopeng Fan, Peter Wonka, Evgeny Burnaev
Title: One Algorithm to Align Them All
Abstract: We suggest a new multimodal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation it is able to show comparable quality while working orders of magnitude faster.
PaperID: 1087,   Poster  https://arxiv.org/pdf/2512.21985    
Authors: Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner
Title: LVLM-Aided Alignment of Task-Specific Vision Models
Abstract: In highstakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model’s dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.
PaperID: 1088,   Poster  https://arxiv.org/pdf/2603.00595    
Authors: Chenggang Rong, Tao Han, Zhiyuan Zhao, Yaowu Fan, Jia Wan, Song Guo, Yuan Yuan, Junyu Gao
Title: UNICBench: UNIfied Counting Benchmark for MLLM
Abstract: Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi‑level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three‑level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality‑specific matching rules, we evaluate 45 state‑of‑the‑art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long‑tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
PaperID: 1089,   Poster  https://arxiv.org/pdf/2405.15965    
Authors: Haiyu Wu, Sicong Tian, Aman Bhatta, Jacob Gutierrez, Grace Bezold, Genesis Argueta, Karl Ricanek, Michael King, Kevin W. Bowyer
Title: Goldilocks Test Sets for Face Verification
Abstract: Reported face verification accuracy has reached a plateau on current wellknown test sets. As a result, some difficult test sets have been assembled by reducing the image quality or adding artifacts to the image. However, we argue that test sets can be challenging without artificially reducing the image quality because the face recognition (FR) models suffer from correctly recognizing 1) the pairs from the same identity (i.e., genuine pairs) with a large face attribute difference, 2) the pairs from different identities (i.e., impostor pairs) with a small face attribute difference, and 3) the pairs of similar-looking identities (e.g., twins and relatives). We propose three challenging test sets to reveal important but ignored weaknesses of the existing FR algorithms. To challenge models on variation of facial attributes, we propose Hadrian and Eclipse to address facial hair differences and face exposure differences. The images in both test sets are high-quality and collected in a controlled environment. To challenge FR models on similar-looking persons, we propose twins-IND, which contains images from a dedicated twins dataset. The LFW test protocol is used to structure the proposed test sets. Moreover, we introduce additional rules to assemble “Goldilocks1" level test sets, including 1) restricted number of occurrence of hard samples, 2) equal chance evaluation across demographic groups, and 3) constrained identity overlap across validation folds. Quantitatively, without further processing the images, the proposed test sets have on-par or higher difficulties than the existing test sets that add artifacts to the images.
PaperID: 1090,   Poster  https://arxiv.org/pdf/2603.03282    
Authors: M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt
Title: MIBURI: Towards Expressive Interactive Gesture Synthesis
Abstract: Embodied Conversational Agents (ECAs) aim to emulate human faceto-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)–based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs either produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, an online, causal framework for generating expressive co-speech gestures and facial expressions synchronized with real-time spoken dialogue. We first employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce contrastive objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines.
PaperID: 1091,   Poster  https://arxiv.org/pdf/2511.18765    
Authors: Hui Shan, LI MING, Haitao Yang, Kai Zheng, SIzhe Zheng, Yanwei Fu, Xiangru Huang
Title: NI-Tex: Non-isometric Image-based Garment Texture Generation
Abstract: Existing industrial 3D garment meshes already cover most realworld clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility.To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.
PaperID: 1092,   Poster  https://arxiv.org/pdf/2602.19585    
Authors: Chunlei Meng, Jiabin Luo, Zhenglin Yan, Zhenyu Yu, Rong Fu, Zhongxue Gan, Chun Ouyang
Title: Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis
Abstract: Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modalityspecific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC_7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.
PaperID: 1093,   Poster  https://arxiv.org/pdf/2602.22625    
Authors: Seongmin Hong, Junghun James Kim, Daehyeop Kim, Insoo Chung, Se Young Chun
Title: DiffBMP: Differentiable Rendering with Bitmap Primitives
Abstract: We introduceDiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structureaware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
PaperID: 1094,   Poster  https://arxiv.org/pdf/2601.14671    
Authors: Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, Toshihiko Yamasaki
Title: Mirai: Autoregressive Visual Generation Needs Foresight
Abstract: Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling.We formulate this insight with Mirai (meaning "future" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: MiraiE uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations.Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10× and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark.Our study highlights that visual autoregressive models need foresight.
PaperID: 1095,   Poster  https://arxiv.org/pdf/2511.16857    
Authors: Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay
Title: BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
Abstract: Vision–Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test highlevel relationships ("left of," "behind", etc.) but ignore fine-grained spatial understanding needed for real-world applications: precise 3D localization, physical compatibility between objects, object affordances and multi-step spatial planning. In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine-grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question–answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open-sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
PaperID: 1096,   Poster  https://arxiv.org/pdf/2602.18853    
Authors: Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong
Title: Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
Abstract: Domain Generalization in Semantic Segmentation (DGSS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text–image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S^2-Corr, a state-space-driven text–image correlation refinement mechanism that can mitigate domain-induced distortions and produce a more consistent text–image correlation under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.
PaperID: 1097,   Poster  https://arxiv.org/pdf/2601.06965    
Authors: Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, Hao Jiang, Yueting Zhuang
Title: Unified Personalized Understanding, Generating and Editing
Abstract: Unified large multimodal models (LMMs) have achieved remarkable progress in generalpurpose multimodal understanding and generation. However, they still operate under a ''one-size-fits-all'' paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt\) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge.We present OmniPersona, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior.To systematically evaluate unified personalization, we propose \textttOmniPBench, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.
PaperID: 1098,   Poster  https://arxiv.org/pdf/2509.13801    
Authors: Wenlve Zhou, Zhiheng Zhou, Tiantao Xian, Yikui Zhai, Weibin Wu, Biyun MA
Title: Masked Representation Modeling for Domain-Adaptive Segmentation
Abstract: Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary selfsupervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low-level signals (e.g., pixels or visual tokens), MRM targets high-level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test-time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA \rightarrow Cityscapes and +2.8 mIoU on Cityscapes \rightarrow Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.
PaperID: 1099,   Poster  https://arxiv.org/pdf/2511.19996    
Authors: Dishanika Dewani Denipitiyage, Naveen Karunanayake, Suranga Seneviratne, Sanjay Chawla
Title: RankOOD - Class Ranking-based Out-of-Distribution Detection
Abstract: We propose RankOOD, a rankbased Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss, in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small. RankOOD, achieves SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.
PaperID: 1100,   Poster  https://arxiv.org/pdf/2603.21175    
Authors: Kwanyoung Kim, Byeongsu Sim
Title: Reward Sharpness-Aware Fine-Tuning for Diffusion Models
Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of rewardcentric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model, without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
PaperID: 1101,   Poster  https://arxiv.org/pdf/2511.13889    
Authors: Abdul Rehman, Iqra Rasool, Ayisha Imran, Mohsen Ali, Waqas Sultani
Title: Uni-Hema: Unified Model for Digital Hematopathology
Abstract: Digital hematopathology requires celllevel analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question–answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and linguistic representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.
PaperID: 1102,   Poster  https://arxiv.org/pdf/2602.07564    
Authors: Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song
Title: SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
Abstract: Recent unified models such as Bagel demonstrate that paired image–edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to singlecondition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text–image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.
PaperID: 1103,   Poster  https://arxiv.org/pdf/2603.10158    
Authors: Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang, Xueyan Zou
Title: Cross-Hand Latent Representation for Vision-Language-Action models
Abstract: Dexterous manipulation is essential for realworld robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception—vision, sound, and language-guided intent—to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce \ourmethod, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that \ourmethod consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
PaperID: 1104,   Poster  https://arxiv.org/pdf/2511.18378    
Authors: Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang, Jiarui Jin, Yuan Lu, Hanqian Wu, Cunjian Chen
Title: Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Abstract: Textto-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.
PaperID: 1105,   Poster  https://arxiv.org/pdf/2602.19064    
Authors: QUAN LIU, Xiaoqin Zhang, Ling Shao, Shijian Lu
Title: L3DR: 3D-aware LiDAR Diffusion and Rectification
Abstract: Rangeview (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp and authentic boundaries. Leveraging such analysis, we design a 3D residual regression network that rectifies RV artifacts and achieves superb geometry realism by predicting point-level offsets in 3D space. On top of that, we design a Welsch Loss that helps focus on local geometry and ignore anomalous regions effectively. Extensive experiments over multiple benchmarks including KITTI, KITTI360, nuScenes and Waymo show that the proposed L3DR achieves state-of-the-art generation and superior geometry-realism consistently. In addition, L3DR is generally applicable to different LiDAR diffusion models with little computational overhead. Codes will be released.
PaperID: 1106,   Poster  https://arxiv.org/pdf/2510.12798    
Authors: Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang
Title: Detect Anything via Next Point Prediction
Abstract: Object detection has long been dominated by traditional coordinate regressionbased models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; 3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.
PaperID: 1107,   Poster  https://arxiv.org/pdf/2512.06674    
Authors: yueming lyu, Rufan Qian, Yueming Lyu, Qinglong Liu, Linzhuang Zou, Jie Qin, Songhua Liu, Caifeng Shan
Title: RunawayEvil: Jailbreaking the Image-to-Video Generative Models
Abstract: Imageto-Video (I2V) generation represents a frontier in content creation, where models synthesize dynamic visual sequences by jointly reasoning from both image and text prompts. This multimodal grounding enables diverse controllability over video attributes. However, it is precisely this capability that introduces a critical security blind spot: by exploiting the interplay between visual and textual cues, attackers can launch multimodal jailbreak attacks that severely compromise output security. Despite the increasing implementation of security mechanisms in real-world I2V systems, such cross-modal threats remain unexplored. Existing attack methods remain confined to single-modal settings, relying solely on isolated text or image perturbations, which severely limits their effectiveness. To bridge this gap, we propose Runaway Evil, the first multimodal jailbreaking framework for I2V models with dynamic evolutionary capability. Built on a Strategy-Tactic-Action paradigm, our framework exhibits self-amplifying attack through three core components: (1) a strategy-aware command unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and large language model (LLM)-based strategy exploration; (2) a multimodal tactical planning unit that generates synergistic text jailbreak instructions and image tampering guidelines based on the selected strategies; and (3) an tactical action Unit executes and evaluates the coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate that Runaway Evil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. This work provides a critical tool for probing and mitigating multimodal vulnerabilities, laying a foundation for building more robust video generation systems.
PaperID: 1108,   Poster  https://arxiv.org/pdf/2504.10190    
Authors: Kaushik S, Paul Henderson, Fani Deligianni
Title: Differentially Private 2D Human Pose Estimation
Abstract: Human pose estimation (HPE) underpins critical applications in healthcare, activity recognition, and humancomputer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. %Conventional anonymization techniques offer weak protection and are unquantifiable, while Differential Privacy (DP) provides formal guarantees but often results in steep performance costs.We introduce the first unified framework for differentially private 2D Human Pose Estimation (2D-HPE) that achieves strong privacy-utility trade-offs for structured visual prediction through complementary noise mitigation mechanisms. Our Feature-Projective DP integrates: (1) subspace projection that reduces noise variance by a factor k/p by restricting gradient updates to a k-principal subspace within the full p-dimensional parameter space, and (2) feature-level privacy, which selectively privatizes sensitive features while retaining public visual cues. Together these mechanisms yield a multiplicative utility gain under formal privacy constraints.%We further propose a feature-projective hybrid that combines both the mechanisms within a single post-processing framework.Extensive experiments on MPII and HumanART datasets across privacy budgets (\varepsilon \in \0.2, 0.4, 0.6, 0.8\), clipping thresholds (C \in \ \0.01, 0.1, 1.0\) and training strategies demonstrate consistent improvements over vanilla DP-SGD. At \varepsilon=0.8, our method achieves 82.61% PCKh@0.5, recovering 73% of the privacy induced performance gap. Cross-dataset evaluation on the HumanART confirms generalization (51.6 AP). Our study provides the first rigorous benchmark and a practical blueprint for privacy-preserving pose estimation in sensitive, real-world applications. Our source code will be made public on acceptance.
PaperID: 1109,   Poster  https://arxiv.org/pdf/2603.18495    
Authors: Jooyoung Kim, Wonje Choi, Younguk Song, Honguk Woo
Title: Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
Abstract: Recent advances in VisionLanguage Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures, providing a reliable synthesis of code policies. NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.
PaperID: 1110,   Poster  https://arxiv.org/pdf/2511.19065    
Authors: JinYoung Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer
Title: Understanding, Accelerating, and Improving MeanFlow Training
Abstract: MeanFlow promises highquality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
PaperID: 1111,   Poster  https://arxiv.org/pdf/2510.23588    
Authors: GuangTing Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
Title: FARMER: Flow AutoRegressive Transformer over Pixels
Abstract: Directly modeling the explicit likelihood of the raw data distribution is a key topic in the machine learning area, which achieves the scaling success in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffers from extremely long sequences and highdimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NFs) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is implicitly modeled by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
PaperID: 1112,   Poster  https://arxiv.org/pdf/2603.09771    
Authors: Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi
Title: Ego: Embedding-Guided Personalization of Vision-Language Models
Abstract: AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pretrained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model’s inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model’s internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
PaperID: 1113,   Poster  https://arxiv.org/pdf/2509.23724    
Authors: Lars Doorenbos, Federico Spurio, Jürgen Gall
Title: Video Panels for Long Video Understanding
Abstract: Recent VideoLanguage Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models.To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs.Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.
PaperID: 1114,   Poster  https://arxiv.org/pdf/2509.15496    
Authors: Shen Sang, Tiancheng Zhi, Tianpei Gu, Jing Liu, Linjie Luo
Title: Lynx: Towards High-Fidelity Personalized Video Generation
Abstract: We present Lynx, a highfidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation. Code and models will be released publicly upon publication.
PaperID: 1115,   Poster  https://arxiv.org/pdf/2504.10833    
Authors: Shubham Kumar, Narendra Ahuja
Title: Measuring the (Un)Faithfulness of Concept-Based Explanations
Abstract: Deep vision models perform inputoutput computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful---that is, they represent the model's internal computation---requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check---explanations with random concepts should be less faithful---which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.
PaperID: 1116,   Poster  https://arxiv.org/pdf/2512.14008    
Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
Title: Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Models
Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose SparseLaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as sparse representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2× speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.
PaperID: 1117,   Poster  https://arxiv.org/pdf/2602.22394    
Authors: Cheng Shi, Yizhou Yu, Sibei Yang
Title: Vision Transformers Need More Than Registers
Abstract: Vision Transformers (ViTs), when pretrained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior. All the code and weights will be made publicly available.
PaperID: 1118,   Poster  https://arxiv.org/pdf/2603.02413    
Authors: Filippo Ghilotti, Edoardo Palladin, Samuel Brucker, Adam Sigal, Mario Bijelic, Felix Heide
Title: TruckDrive: Long-Range Autonomous Highway Driving Dataset
Abstract: Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highwayscale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters to for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.
PaperID: 1119,   Poster  https://arxiv.org/pdf/2601.11404    
Authors: Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Si Liu, Guanghui Ren
Title: ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models have emerged essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings.Recent advancements have introduced explicit intermediary reasoning—such as subtask prediction (language) or goal image synthesis (vision)—to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning.Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.45%, 84.14% and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.
PaperID: 1120,   Poster  https://arxiv.org/pdf/2602.20208    
Authors: Longhua Li, Lei Qi, Qi Tian, Xin Geng
Title: Model Merging in the Essential Subspace
Abstract: Model merging aims to integrate multiple taskspecific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task's parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.
PaperID: 1121,   Poster  https://arxiv.org/pdf/2604.08077    
Authors: Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, YuCheng YuCheng, Bo Zheng, Jing Liu
Title: Adaptive Sparsity for Efficient Long-Video Understanding
Abstract: Processing longform videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which dynamically selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
PaperID: 1122,   Poster  https://arxiv.org/pdf/2604.11064    
Authors: Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng
Title: A Faster Path to Continual Learning
Abstract: Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimizationbased approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments demonstrate that C-Flat Turbo accelerates a wide range of CL methods by at least 1× (and up to 1.25×) compared to C-Flat, while achieving comparable or even improved accuracy.
PaperID: 1123,   Poster  https://arxiv.org/pdf/2505.17353    
Authors: Minseo Kim, Axel Levy, Gordon Wetzstein
Title: Dual Ascent Diffusion for Inverse Problems
Abstract: Illposed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.
PaperID: 1124,   Poster  https://arxiv.org/pdf/2512.21627    
Authors: Junjun Hu, Xinda Xue, Botao Ren, Minghua Luo, Jintao Chen, Haochen Bai, Liangliang You, Mu Xu
Title: AstraNav-Memory: Contexts Compression for Long Memory
Abstract: Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial–semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While objectcentric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL–based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer reduces native vision tokens by roughly 10–20×, representing each image with about 30 tokens and allowing the agent to maintain hundreds of historical frames within a single context. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-like efficiency.
PaperID: 1125,   Poster  https://arxiv.org/pdf/2603.05768    
Authors: Levy Chaves, Chao Zhou, Rebekka Burkholz, Eduardo Valle, Sandra Avila
Title: Bridging Domains through Subspace-Aware Model Merging
Abstract: Model merging integrates multiple taskspecific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition (SVD), we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.
PaperID: 1126,   Poster  https://arxiv.org/pdf/2604.12665    
Authors: Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, Xinchao Wang
Title: Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
Abstract: Motion reasoning serves as the cornerstone of multiobject tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear.To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded.To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial–temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation.Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.
PaperID: 1127,   Poster  https://arxiv.org/pdf/2604.10032    
Authors: CHI ZHANG, Jingpu Cheng, Zhixian Wang, Ping Liu
Title: Closed-Form Concept Erasure via Double Projections
Abstract: Modern generative models, including diffusionbased architectures, produce highly creative outputs but also pose safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.
PaperID: 1128,   Poster  https://arxiv.org/pdf/2511.19995    
Authors: Jiyeon Han, Ali Mahdavi Amiri, Hao Zhang, Haedong Jeong
Title: CREward: A Type-Specific Creativity Reward Model
Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the first type-specific creativity reward model, coined CREward, which spans three creativity “axes,” geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.
PaperID: 1129,   Poster  https://arxiv.org/pdf/2512.20157    
Authors: Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phúc H. Lê Khắc, Ankit Singh, Ngoc Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid
Title: AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model
Abstract: Vision foundation models trained via multiteacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data—typically reserved for self-supervised learning—substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our AMoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT→LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and distilled checkpoints.
PaperID: 1130,   Poster  https://arxiv.org/pdf/2603.13395    
Authors: Chiensheng Chiang, Kuan-Hsun Tu, Jia-Wei Liao, Cheng-Fu Chou, Tsung-Wei Ke
Title: COT-FM: Cluster-wise Optimal Transport Flow Matching
Abstract: We introduce COTFM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batch-wise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image benchmarks, and robotic manipulation tasks.
PaperID: 1131,   Poster  https://arxiv.org/pdf/2604.02654    
Authors: Yuqing Huang, Liting Lin, Weijun Zhuang, Zhenyu He, Xin Li
Title: Drift-Resilient Temporal Priors for Visual Tracking
Abstract: Temporal information is crucial for visual tracking, but existing multiframe trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures—OSTrack, ODTrack, and LoRAT—and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.
PaperID: 1132,   Poster  https://arxiv.org/pdf/2511.11005    
Authors: SungHeon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani
Title: Draft and Refine with Visual Experts
Abstract: While recent Large Vision–Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded,hallucinatedresponses by overrelying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We proposeDraft and Refine (DnR), an agent framework driven by a novelquestion-conditioned utilization metric. This metric quantifies the model’s actual reliance on visual evidence by first constructing aquery-conditioned relevance mapto localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initialdraftthrough targeted feedback from external visual experts. Each expert’s output (e.g., boxes, masks) is rendered as visual cues on the image, and the VLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.
PaperID: 1133,   Poster  https://arxiv.org/pdf/2510.06679    
Authors: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, JiyangLiu JiyangLiu, Jingyao Li, Haoru Tan, WU Sitong, Chengyao Wang, Yitong Wang, Bei Yu, Jiaya Jia
Title: DreamOmni2: Multimodal Instruction-based Generation and Editing
Abstract: Recent advancements in instructionbased image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
PaperID: 1134,   Poster  https://arxiv.org/pdf/2604.14925    
Authors: Dongsheng Wang, Jinsen Zhang, Dawei Su, Hui Huang
Title: Improving Sparse Autoencoder with Dynamic Attention
Abstract: Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity can lead to poor reconstruction, whereas insufficient sparsity may harm interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherrypicked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner. Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of elements according to the complexity of each neuron, resulting in a more flexible and general activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts, particularly in top-n classification tasks.
PaperID: 1135,   Poster  https://arxiv.org/pdf/2603.26900    
Authors: Sasidharan Mahalingam, Rachel Brown, Atul Ingle
Title: Computer Vision with a Superpixelation Camera
Abstract: Conventional cameras generate a lot of data that can be challenging to process in resourceconstrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.
PaperID: 1136,   Poster  https://arxiv.org/pdf/2603.13843    
Authors: Lv Bo, Qingwang Zhang, Le Wu, Yuanyuan Li, YINGYING ZHU
Title: MOGeo: Beyond One-to-One Cross-View Object Geo-localization
Abstract: CrossView Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistc setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods.Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.
PaperID: 1137,   Poster  https://arxiv.org/pdf/2511.23227    
Authors: Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li
Title: PointCNN++: Performant Convolution on Native Points
Abstract: Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: pointbased methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It generalizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates natively on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ uses an order of magnitude less memory and is several times faster than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it significantly improves point cloud registration accuracies while proving both more memory-efficient and faster. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency.
PaperID: 1138,   Poster  https://arxiv.org/pdf/2603.12915    
Authors: Kiseong Hong, JungKyoo Shin, Eunwoo Kim
Title: Stake the Points: Structure-Faithful Instance Unlearning
Abstract: Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion–retention balance. In this work, we propose a novel structurefaithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion–retention trade-off and enhancing generalization.
PaperID: 1139,   Poster  https://arxiv.org/pdf/2511.12898    
Authors: Zhiqi Li, Yuchen Sun, Greg Turk, Bo Zhu
Title: Functional Mean Flow in Hilbert Space
Abstract: We present Functional Mean Flow (FMF) as a onestep generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an x_1-prediction variant that improves stability over the original u-prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.
PaperID: 1140,   Poster  https://arxiv.org/pdf/2603.20836    
Authors: Wenjun Huang, Shenghao Fu, Yian Jin, Yang Ni, Ziteng Cui, Hanning Chen, Yirui He, Yezi Liu, Sanggeon Yun, SungHeon Jeong, Ryozo Masukawa, William Chung, Mohsen Imani
Title: MERIT: Multi-domain Efficient RAW Image Translation
Abstract: RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream vision tasks. Prior methods address this by training oneto-one RAW-to-RAW translators for each source-target domain pair, but such approaches do not scale to real-world scenarios with multiple cameras. We introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. Additionally, we enhance the generator’s context modeling with a conditional multi-scale large kernel attention module, enabling efficient capture of both global illumination and fine-grained sensor cues. To support standardized evaluation, we construct MDRAW, a new dataset of paired and unpaired RAW images from five diverse camera sensors. Extensive experiments on existing and newly proposed benchmarks demonstrate that MERIT significantly outperforms prior models in both accuracy and scalability, offering a practical and generalizable solution to cross-domain RAW image harmonization.
PaperID: 1141,   Poster  https://arxiv.org/pdf/2512.20136    
Authors: Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon, Sung June Kim, Hyeonsoo Im, JeungSub Lee, Sangpil Kim
Title: M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
Abstract: RetrievalAugmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M^3KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M^3KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M^3KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.
PaperID: 1142,   Poster  https://arxiv.org/pdf/2504.11434    
Authors: Yifan Ding, Xixi Liu, Jonas Unger, Gabriel Eilertsen
Title: Enhancing Out-of-Distribution Detection with Extended Logit Normalization
Abstract: Outof-distribution (OOD) detection is essential for the safe deployment of machine learning models. Extensive work has focused on devising various scoring functions for detecting OOD samples, while only a few studies focus on training neural networks using certain model calibration objectives, which often lead to a compromise in predictive accuracy and support only limited choices of scoring functions. In this work, we first identify the feature collapse phenomena in Logit Normalization (LogitNorm), then propose a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. To be specific, we devise a feature distance-awareness loss term in addition to LogitNorm, termed ELogitNorm, which enables improved OOD detection and in-distribution (ID) confidence calibration. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.
PaperID: 1143,   Poster  https://arxiv.org/pdf/2511.16555    
Authors: Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk
Title: Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Abstract: Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zeroshot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong zero-shot generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% of their computational cost, setting a new standard for efficient stereo matching. Code will be made publicly available.
PaperID: 1144,   Poster  https://arxiv.org/pdf/2601.17900    
Authors: Shengjun Zhang, Min Chen, Yibo Wei, Mingyu Dong, Yueqi Duan
Title: Revisiting 3D Reconstruction Kernels as Low-Pass Filters
Abstract: 3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal lowpass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal’s spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further propose modulated kernels to strick an effective balance, and achieves superior rendering performance by reconciling spatial efficiency and frequency-domain fidelity.Experimental results have demonstrated the effectiveness of our Jinc and modulated kernels.
PaperID: 1145,   Poster  https://arxiv.org/pdf/2604.19257    
Authors: Hongyuan Liu, Bochao Zou, Qiankun Liu, Haochen Yu, Qi Mei, Jianfei Jiang, Chen Liu, Cheng Bi, Zhao Wang, Xueyang Zhang, Yifei Zhan, Jiansheng Chen, Huimin Ma
Title: Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Abstract: Creating realistic and simulationready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train a image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameter from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed data. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.
PaperID: 1146,   Poster  https://arxiv.org/pdf/2603.16100    
Authors: Jonas Herzog, Yue Wang
Title: Reevaluating the Intra-modal Misalignment Hypothesis in CLIP
Abstract: Recent research has indicated that the embeddings generated by contrastive languageimage training like CLIP may not be ideal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images.In this study, we question this intra-modal misalignment hypothesis.We reexamine the theoretical arguments and techniques that seek to demonstrate the misalignment.Our findings reveal that neither the distribution of cosine similarities nor few-shot or retrieval metrics serve as reliable indicators of misalignment.In fact, these metrics yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), which indicates there is no intra-modal misalignment stemming from contrastive language-image training.We argue the observed phenomena can be explained without assuming a fundamental flaw in the image embedding space.Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for achieving strong performance.
PaperID: 1147,   Poster  https://arxiv.org/pdf/2602.20053    
Authors: Jiahui Chen, Zehang Deng, Zeyu Zhang, Chaoyang Li, Lianchen Jia, Lifeng Sun
Title: Decoupling Defense Strategies for Robust Image Watermarking
Abstract: Deep learningbased image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges:(1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks.To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy.In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality.Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29%, 33% and 46% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.
PaperID: 1148,   Poster  https://arxiv.org/pdf/2602.19900    
Authors: Junyi Wang, Yudong Guo, Boyang Guo, Shengming Yang, Juyong Zhang
Title: ExpPortrait: Expressive Portrait Generation via Personalized Representation
Abstract: While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or lowrank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
PaperID: 1149,   Poster  https://arxiv.org/pdf/2604.01466    
Authors: Scott Xu, Dian Chen, Kelvin Wong, Chris Zhang, Kion Fallah, Raquel Urtasun
Title: Efficient Equivariant Transformer for Self-Driving Agent Modeling
Abstract: Accurately modeling agent behaviors is an important task in selfdriving.It is also a task with many symmetries, such as equivariance to the order ofagents and objects in the scene or equivariance to arbitrary roto-translationsof the entire scene as a whole; i.e., SE(2)-equivariance.The transformer architecture is a ubiquitous tool for modeling these symmetries.While standard self-attention is inherently permutation equivariant,explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance.However, this approach introduces an additional cost that is quadratic in the number of agents,limiting its scalability to larger scenes and batch sizes.In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achievesSE(2)-equivariance without the computational cost of existing methods.Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elementsas multivectors in the 2D projective geometric algebra \mathbbR^_2,0,1 and processesthem with a stack of equivariant transformer blocks.Crucially, DriveGATr models geometric relationships using standard attentionbetween multivectors, eliminating the need for costly explicit pairwise relative positional encodings.Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to thestate-of-the-art in traffic simulation and establishes a superior Pareto front for performancevs computational cost.
PaperID: 1150,   Poster  https://arxiv.org/pdf/2511.16993    
Authors: junhong min, Jimin Kim, Minwook Kim, Cheol-Hui Min, YOUNGPIL JEON, Minyong Choi
Title: DepthFocus: Controllable Depth Estimation for See-Through Scenes
Abstract: Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intentdriven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.
PaperID: 1151,   Poster  https://arxiv.org/pdf/2511.18822    
Authors: Zhennan Chen, junwei zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
Title: DiP: Taming Diffusion Models in Pixel Space
Abstract: Diffusion models face a fundamental tradeoff between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10× faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256×256.
PaperID: 1152,   Poster  https://arxiv.org/pdf/2505.18604    
Authors: ISAAC NING LEE, Leila Mahmoodi, Trung Le, Mehrtash Harandi
Title: Exemplar-Free Continual Learning for State Space Models
Abstract: StateSpace Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose unique challenges in Continual Learning (CL). Without access to the full distribution of previous tasks, updates to the state-space dynamics become unconstrained, leading to catastrophic forgetting. To address this, we propose Inf-SSM, a geometry-aware regularization framework for CL in SSMs. It constrains state evolution via the infinite-dimensional Grassmannian of SSM observability subspaces, without requiring any exemplars from past tasks. Unlike classical CL methods that restrict weight updates, Inf-SSM directly regularizes the infinite-horizon state evolution encoded by the extended observability subspace of the SSM. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs \mathcalO(n^3) complexity. Thus, we develop a \mathcalO(n^2) solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks of ImageNet-R, CIFAR-100, and Caltech-256 demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.
PaperID: 1153,   Poster  https://arxiv.org/pdf/2604.12894    
Authors: Prashanth Chandran, Daoye Wang, Timo Bolkart
Title: Representing 3D Faces with Learnable B-Spline Volumes
Abstract: We present CUBE (Controlbased Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., 8 × 8 × 8) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to predict a residual from the base shape, resulting in refined 3D point coordinates. To reconstruct 3D surfaces in dense semantic correspondence, we query CUBE at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support of traditional B-spline representations, enabling us to locally edit the surface by updating individual control features. We demonstrate the strengths of this representation by training two transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent geometric and multi-view baselines.
PaperID: 1154,   Poster  https://arxiv.org/pdf/2603.17910    
Authors: Marcos Ferreira, Tianao Li, John Mamish, Josiah Hester, Yaman Sangar, Qi Guo, Emma Alexander
Title: SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Abstract: We introduce SpiderCam, an FPGAbased snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 611 mW of power in total. SpiderCam comprises a custom camera which simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.
PaperID: 1155,   Poster  https://arxiv.org/pdf/2511.22287    
Authors: Kate Feingold, Omri Kaduri, Tali Dekel
Title: Match-and-Fuse: Consistent Generation from Unstructured Image Sets
Abstract: We present Matchand-Fuse - a zero-short, training-free method for consistent controlled generation of unstructured image sets -- collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while achieving global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. This also allows us to leverage an emergent prior in text‑to‑image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.
PaperID: 1156,   Poster  https://arxiv.org/pdf/2603.29634    
Authors: Hengyu Zeng, Xin Gao, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Ma Junpeng, Haoyu Albert Wang, Jian Pu
Title: MacTok: Robust Continuous Tokenization for Image Generation
Abstract: Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce MacTok, a Masked Augmenting 1D Continuous Tokenizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINOguided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256×256 and a state-of-the-art 1.52 at 512×512 with SiT-XL, while reducing token usage by up to 64×. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.
PaperID: 1157,   Poster  https://arxiv.org/pdf/2601.09881    
Authors: Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, Arash Vahdat
Title: Transition Matching Distillation for Fast Video Generation
Abstract: Large video diffusion and flow models have achieved remarkable success in highquality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video flow model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence.
PaperID: 1158,   Poster  https://arxiv.org/pdf/2602.23361    
Authors: Sven Elflein, Ruilong Li, Sérgio Agostinho, Žan Gojčič, Laura Leal-Taixe, Qunjie Zhou, Aljoša Ošep
Title: VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feedforward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T^3 (\mathbfVisual \mathbfGeometry \mathbfGrounded \mathbfTest \mathbfTime \mathbfTraining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a 11.6× speed-up over baselines that rely on softmax attention for reconstructing a 1k image collection in just 54 seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction error is comparable to VGGT.
PaperID: 1159,   Poster  https://arxiv.org/pdf/2511.21681    
Authors: Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
Title: Seeing without Pixels: Perception from Camera Trajectories
Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory—the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from crossmodal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
PaperID: 1160,   Poster  https://arxiv.org/pdf/2604.01659    
Authors: Yukai Ma, Honglin He, Selina Song, Wayne Wu, Bolei Zhou
Title: AURA: Multi-modal Shared Autonomy for Urban Navigation
Abstract: Longhorizon navigation in complex urban environments still relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate in the same action space, resulting in high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align human instructions with visual and spatial context. To facilitate training, we construct UrbanWalks, a large-scale dataset composed of teleoperation and vision-language description data. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation and continuous learning.Moreover, under similar takeover conditions, our hierarchical shared autonomy framework reduces human operation Frequency by over 75%. Code and data will be made available.
PaperID: 1161,   Poster  https://arxiv.org/pdf/2507.04482    
Authors: Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im
Title: A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Abstract: We present a trainingfree framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
PaperID: 1162,   Poster  https://arxiv.org/pdf/2603.14647    
Authors: Guangyu Meng, Pengfei Gu, Peixian Liang, John P. Lalor, Erin Chambers, Danny Chen
Title: TopoCL: Topological Contrastive Learning for Medical Imaging
Abstract: Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topologyaware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.
PaperID: 1163,   Poster  https://arxiv.org/pdf/2603.26113    
Authors: Kang Zhang, Suyeon Lee, Arda Senocak, Joon Chung
Title: Cinematic Audio Source Separation Using Visual Cues
Abstract: Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audioonly, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks.
PaperID: 1164,   Poster  https://arxiv.org/pdf/2511.19629    
Authors: Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
Title: SkillSight: Efficient First-Person Skill Assessment with Gaze
Abstract: Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for powerefficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they directtheir attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73× less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.
PaperID: 1165,   Poster  https://arxiv.org/pdf/2603.08483    
Authors: Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Koo, Junyong Noh
Title: X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generatorside view and observe that internal cross-attention mechanisms in these models encode fine-grained speech–motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) audio–visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful, cross-generator evaluation, we further introduce MMDF, a new multi-modal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by +13.1%. Our findings highlight the importance of leveraging internal audio–visual consistency cues for robustness to future generators in deepfake detection. The code and dataset will soon be available.
PaperID: 1166,   Poster  https://arxiv.org/pdf/2603.19039    
Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
Title: TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Abstract: Visionlanguage models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
PaperID: 1167,   Poster  https://arxiv.org/pdf/2603.17571    
Authors: Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi
Title: PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
Abstract: Panoramic imagery offers a full 360^\circ field of view and is increasingly common in consumer devices. However, it introduces nonpinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
PaperID: 1168,   Poster  https://arxiv.org/pdf/2512.22096    
Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Kaipeng Zhang
Title: Yume1.5: A Text-Controlled Interactive World Generation Model
Abstract: Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit realtime performance and lack text-controlled generation capabilities.To address these challenges, we propose Yume1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. Yume1.5 achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation method combining unified context compression and linear attention; (2) a context compression-based bidirectional attention distillation approach with an enhanced text embedding scheme for real-time streaming video generation. Yume1.5 achieves an average generation speed of 12 fps at 540p resolution using only a single A100 GPU; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material. The model weights and full codebase will be made public.
PaperID: 1169,   Poster  https://arxiv.org/pdf/2511.21662    
Authors: Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
Title: Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, finegrained evaluation criteria remains underexplored.We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts.Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria—especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment.Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges.As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
PaperID: 1170,   Poster  https://arxiv.org/pdf/2506.13130    
Authors: Yuiga Wada, Kazuki Matsuda, Komei Sugiura, Graham Neubig
Title: ZINA: Multimodal Fine-grained Hallucination Detection and Editing
Abstract: Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a finegrained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.
PaperID: 1171,   Poster  https://arxiv.org/pdf/2512.10805    
Authors: Akshay R. Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli
Title: Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
Abstract: Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, userdesired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE)—a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
PaperID: 1172,   Poster  https://arxiv.org/pdf/2603.09104    
Authors: Zixuan Wang, Ziqin Zhou, Feng Chen, DUO PENG, Yixin Hu, Changsheng Li, Yinjie Lei
Title: Training-free Motion Factorization for Compositional Video Generation
Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in realworld scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. \textcolor[RGB]237,0, 140Our code will be released soon.
PaperID: 1173,   Poster  https://arxiv.org/pdf/2604.09206    
Authors: Jiahao Wang, Zikun Xu, Yuner Zhang, Zhongwei Jiang, Chenyang Lu, Shuocheng Yang, Yuxuan Wang, Jiaru Zhong, Chuang Zhang, Shaobing Xu, Jianqiang Wang
Title: Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
Abstract: Cooperative 3D perception via Vehicleto-Everything (V2X) communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution.However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks:the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors.To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception.Our method introduces two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects,and a learnable Context-Aware Association module that robustly matches cooperative queries even despite severe positional noise.Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging long-range settings, while maintaining superior computational efficiency and a highly competitive transmission cost.
PaperID: 1174,   Poster  https://arxiv.org/pdf/2511.17181    
Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Title: Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
Abstract: Selfsupervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, the models attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.
PaperID: 1175,   Poster  https://arxiv.org/pdf/2602.24275    
Authors: Junxian Huang, Ruichu Cai, Juntao Fang, Hao Zhu, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao
Title: Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Abstract: Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to oversegment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (HAL) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action govern the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The HAL model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the HAL model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
PaperID: 1176,   Poster  https://arxiv.org/pdf/2511.15605    
Authors: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
Title: SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model’s own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of Latent World Representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model’s latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO’s efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
PaperID: 1177,   Poster  https://arxiv.org/pdf/2601.02536    
Authors: Shaden Shaar, Bradon Michael Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan
Title: MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
Abstract: Understanding realworld videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers.In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark,MovieRecapscreated using movie recap videos—a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities.Using the recap summary, we generate 8.2K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner.To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation.Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis.We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.
PaperID: 1178,   Poster  https://arxiv.org/pdf/2601.06928    
Authors: Shenghao Zhang, Runtao Liu, Christopher Schroers, Yang Zhang
Title: RenderFlow: Single-Step Neural Rendering via Flow Matching
Abstract: Conventional physicallybased rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework RenderFlow built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.
PaperID: 1179,   Poster  https://arxiv.org/pdf/2510.24021    
Authors: Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
Title: SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
Abstract: Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply tokenwise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence'' to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models—without architectural changes or extra reference models.
PaperID: 1180,   Poster  https://arxiv.org/pdf/2602.11124    
Authors: Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu
Title: PhyCritic: Multimodal Critic Models for Physical AI
Abstract: With the rapid development of large multimodal models, reliable judge and critic models have become essential for openended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
PaperID: 1181,   Poster  https://arxiv.org/pdf/2603.06973    
Authors: Chaohong Guo, Yihan He, Yongwei Nie, Fei Ma, Xuemiao Xu, Chengjiang Long
Title: T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
Abstract: Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing VisionLMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.
PaperID: 1182,   Poster  https://arxiv.org/pdf/2511.22686    
Authors: Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor
Title: Emergent Extreme-View Geometry in 3D Foundation Models
Abstract: 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, nonoverlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.
PaperID: 1183,   Poster  https://arxiv.org/pdf/2603.24383    
Authors: Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding
Title: ViHOI: Human-Object Interaction Synthesis with Visual Priors
Abstract: Generating realistic and physically plausible 3D HumanObject Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM’s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization. The code for this work will be released.
PaperID: 1184,   Poster  https://arxiv.org/pdf/2512.19049    
Authors: Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim
Title: Decoupled Generative Modeling for Human-Object Interaction Synthesis
Abstract: Synthesizing realistic humanobject interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long‑sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.
PaperID: 1185,   Poster  https://arxiv.org/pdf/2504.07503    
Authors: Jinze Chen, Wei Zhai, Yang Cao, Bin Li, Zheng-Jun Zha
Title: Event Stream Filtering via Probability Flux Estimation
Abstract: Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through interevent time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.
PaperID: 1186,   Poster  https://arxiv.org/pdf/2602.14122    
Authors: Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Paudel, Yanwei Fu, Xiangyang Xue
Title: EgoSound: Benchmarking Sound Understanding in Egocentric Videos
Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in visuallanguage understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
PaperID: 1187,   Poster  https://arxiv.org/pdf/2603.26068    
Authors: Elkhan Ismayilzada, Yufei Zhang, Zijun Cui
Title: PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Abstract: Significant advancements made in reconstructing hands from images have delivered accurate singleframe estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN–Transformer backbone, we formulate Euler–Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.
PaperID: 1188,   Poster  https://arxiv.org/pdf/2512.05955    
Authors: Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du
Title: SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Abstract: VisionLanguage Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes.Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning.To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training.From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning.By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence.
PaperID: 1189,   Poster  https://arxiv.org/pdf/2603.23381    
Authors: Yating Xu, Yunqi Miao, Evangelos Ververas, Jiankang Deng, Jifei Song
Title: FG-portrait: 3D Flow Guided Editable Portrait Animation
Abstract: Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusionbased approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation. The source code will be released upon acceptance.
PaperID: 1190,   Poster  https://arxiv.org/pdf/2603.08809    
Authors: Mingshu Cai, Jiajun Li, Osamu Yoshie, Yuya Ieiri, Yixuan Li
Title: Where, What, Why: Toward Explainable 3D-GS Watermarking
Abstract: As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representationnative framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers—optimized for bit resilience under perturbation and bitrate budgets—and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness–quality trade-off compared with prior methods. In addition, the decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.15%.
PaperID: 1191,   Poster  https://arxiv.org/pdf/2512.13597    
Authors: Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde
Title: Lighting in Motion: Spatiotemporal HDR Lighting Estimation
Abstract: We present LiMo, a diffusionbased approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.
PaperID: 1192,   Poster  https://arxiv.org/pdf/2512.07480    
Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, Yan Lu
Title: Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
Abstract: While traditional and neural video codecs (NVCs) have achieved remarkable rate–distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S^2VC, a SingleStep diffusion–based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S^2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
PaperID: 1193,   Poster  https://arxiv.org/pdf/2511.12368    
Authors: Yiqing Shen, Mathias Unberath
Title: Fast Reasoning Segmentation for Images and Videos
Abstract: Reasoning segmentation enables openset object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.
PaperID: 1194,   Poster  https://arxiv.org/pdf/2512.02541    
Authors: Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang
Title: AVGGT: Rethinking Global Attention for Accelerating VGGT
Abstract: Since DUSt3R, models such as VGGT and \pi^3 have shown strong multiview 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and \pi^3 to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component.We instantiate this strategy on VGGT and \pi^3 and evaluate across standard pose and point-map benchmarks. Our method achieves up to 8-10× speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
PaperID: 1195,   Poster  https://arxiv.org/pdf/2604.13305    
Authors: Salma Abdel Magid, Grace Guo, Esin Tureci, Amaya Dharmasiri, Vikram V. Ramaswamy, Hanspeter Pfister, Olga Russakovsky
Title: Bias at the End of the Score
Abstract: Reward models (RMs) are inherently nonneutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used during pretraining, finetuning of models and test-time optimization and post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse during training), their robustness and fairness as scoring functions remains largely unknown. We conduct a large-scale audit of RMs' robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to sexualize female images (especially darker-skinned females), reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight the shortcomings of current RMs, challenging their reliability as quality metrics and underscoring the critical need for alternative data collection, training, and optimization procedures to establish more robust scoring.
PaperID: 1196,   Poster  https://arxiv.org/pdf/2512.10226    
Authors: Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Thomas Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krähenbühl, Marco Pavone, Boris Ivanovic
Title: Latent Chain-of-Thought World Modeling for End-to-End Driving
Abstract: Recent VisionLanguage-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model’s output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model’s action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
PaperID: 1197,   Poster  https://arxiv.org/pdf/2508.06982    
Authors: Yixin Zhu, Zuo-Liang Zhu, Jian Yang, Milos Hasan, Jin Xie, Beibei Wang
Title: WeatherDiffusion: Controllable Weather Editing in Intrinsic Space
Abstract: We present WeatherDiffusion, a diffusionbased framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.
PaperID: 1198,   Poster  https://arxiv.org/pdf/2603.15780    
Authors: Hippolyte Verninas, Caner Korkmaz, Stefanos Zafeiriou, Tolga Birdal, Simone Foti
Title: Parallelised Differentiable Straightest Geodesics for 3D Meshes
Abstract: Machine learning has been progressively generalised to operate within nonEuclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can supercharge geometrically-correct learning and optimisation pipelines. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voroni tesselation. Our code, pre-trained models, and pip-installable library will be made available upon publication.
PaperID: 1199,   Poster  https://arxiv.org/pdf/2512.07348    
Authors: Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, Lei Zhang
Title: MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Abstract: In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e.,MultiImage Composition(MICo), remains a challenging problem, partly hindered by the lack of high-quality training data.To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts.Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting inMICo-150K, a comprehensive dataset for MICo with identity consistency.We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions.To enable comprehensive evaluation, we constructMICo-Benchwith 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric,Weighted-Ref-VIEScore, specifically tailored for MICo evaluation.Finally, we fine-tune multiple models onMICo-150Kand evaluate them onMICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills.Notably,Qwen-MICo, fine-tuned from Qwen-Image-Edit, matchesQwen-Image-2509in 3-image composition while supporting arbitrary multi-image inputs beyond the latter’s limitation.Our dataset and benchmark will be valuable resources for advancing MICo research.
PaperID: 1200,   Poster  https://arxiv.org/pdf/2603.11795    
Authors: Hanyu Shi, Hong Tao, Guoheng Huang, Jianbin Jiang, Xuhang Chen, Chi-Man Pun, Shanhu Wang, Pan Pan
Title: Intrinsic Concept Extraction Based on Compositional Interpretability
Abstract: Unsupervised Concept Extraction aims to extract concepts from a single image, yet existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CIICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.
PaperID: 1201,   Poster  https://arxiv.org/pdf/2503.12799    
Authors: Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, Rongrong Ji
Title: Grounded Chain-of-Thought for Multimodal Large Language Models
Abstract: Despite great progress, existing multimodal large language models (MLLMs) are still inferior in visualspatial reasoning, which greatly impedes their trustworthy applications in scenarios such as Embodied AI. To facilitate the research, we propose a new MLLM task in this paper, called Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT aims to improve the visual-spatial reasoning capabilities of MLLMs via recognizing and grounding the relevant visual cues step by step, which are also supported by step-vise grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a benchmark called multimodal grounded chain-of-thought (MM-GCoT). Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance; iii. a larger and stronger MLLM is not less affected by this issue.
PaperID: 1202,   Poster  https://arxiv.org/pdf/2602.21591    
Authors: Xihua Sheng, lingyu ZHU, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang
Title: CADC: Content Adaptive Diffusion-Based Generative Image Compression
Abstract: Diffusionbased generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck---arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input---prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec (CADC) with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization (UGAQ) method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration (ADGIC) method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning (BFATC) method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost. Comprehensive experimental results show that our codec achieves state-of-the-art perceptual quality at ultra-low bitrates.
PaperID: 1203,   Poster  https://arxiv.org/pdf/2603.22953    
Authors: Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo
Title: Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Abstract: Largescale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video–text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.
PaperID: 1204,   Poster  https://arxiv.org/pdf/2505.20107    
Authors: Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Meng Liu, Wei Yu, Lefei Zhang
Title: Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
Abstract: Textto-multiview (T2MV) diffusion models have shown great promise in generating multiple views of a scene from a single text prompt. While few-step backbones enable real-time T2MV generation, they often compromise key aspects of generation quality, such as per-view fidelity and cross-view consistency. Reinforcement learning (RL) finetuning offers a potential solution, yet existing approaches desgined for single-image diffusion do not readily extend to the few-step T2MV setting, as they neglect cross-view coordination and suffer from weak learning signals in few-step regimes. To address this, we propose MVC-ZigAL, a tailored RL finetuning framework for few-step T2MV diffusion models. Specifically, its core insights are: (1) a new MDP formulation that jointly models all generated views and assesses their collective quality via a joint-view reward; (2) a novel advantage learning strategy that exploits the performance gains of a self-refinement sampling scheme over standard sampling, yielding stronger learning signals for effective RL finetuning; and (3) a unified RL framework that extends advantage learning with a Lagrangian dual formulation for multiview-constrained optimization, balancing single-view and joint-view objectives through adaptive primal-dual updates under a self-paced threshold curriculum that harmonizes exploration and constraint enforcement. Collectively, these designs enable robust and balanced RL finetuning for few-step T2MV diffusion models, yielding substantial gains in both per-view fidelity and cross-view consistency.
PaperID: 1205,   Poster  https://arxiv.org/pdf/2512.15926    
Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
Title: DSO: Direct Steering Optimization for Bias Mitigation
Abstract: Generative models are often deployed to make decisions on behalf of users, such as visionlanguage models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
PaperID: 1206,   Poster  https://arxiv.org/pdf/2512.03575    
Authors: Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, Lin Ma
Title: UniComp: Rethinking Video Compression Through Informational Uniqueness
Abstract: Distinct from attentionbased compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules—Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression—that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.
PaperID: 1207,   Poster  https://arxiv.org/pdf/2511.17221    
Authors: Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand
Title: QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving.Since largescale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels.Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability.We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames.The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data.To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions.QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning.Link to code
PaperID: 1208,   Poster  https://arxiv.org/pdf/2603.19929    
Authors: Sen Jia, Ning Zhu, Jinqin Zhong, Jiale Zhou, Zhang Huaping, Jenq-Neng Hwang, Lei Li
Title: RAM: Recover Any 3D Human Motion in-the-Wild
Abstract: Recovering 3D human motion from monocular videos inthe-wild remains challenging due to occlusions, rapid movements, and viewpoint variations. To address these challenges, we introduceRecover-Anyone Module (RAM), a unified framework for real-time and accurate 3D human motion reconstruction. RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.
PaperID: 1209,   Poster  https://arxiv.org/pdf/2603.26317    
Authors: Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon
Title: Label-Free Cross-Task LoRA Merging with Null-Space Compression
Abstract: Model merging constructs a single model by combining multiple independently finetuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches that uses entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor A in \Delta W = BA compresses its null space, and the compression correlates with performance. NSC uses this as a optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision–language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.
PaperID: 1210,   Poster  https://arxiv.org/pdf/2603.05042    
Authors: Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Title: CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Abstract: Multicamera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
PaperID: 1211,   Poster  https://arxiv.org/pdf/2601.16093    
Authors: yikang zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Shunping Ji, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng CHEN, Xiangtai Li
Title: SAMTok: Representing Any Mask with Two Words
Abstract: Pixelwise capabilities are essential for building interactive intelligent systems. However pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To solve these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two textual special tokens and reconstructs masks from these tokens with high fidelity. By treating masks as a new language, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a simple and scalable paradigm for equipping MLLMs with strong pixel-wise capabilities. Code and models will be available.
PaperID: 1212,   Poster  https://arxiv.org/pdf/2510.01399    
Authors: Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli
Title: Resolving the Identity Crisis in Text-to-Image Generation
Abstract: Stateof-the-art text-to-image models demonstrate impressive realism but suffer from a persistent identity crisis when generating scenes with multiple humans: producing duplicate faces, merging identities, and miscounting individuals. We present DisCo, Reinforcement with DiverSity Constraints, a novel reinforcement learning framework that directly optimizes identity diversity both within images and across groups of generated samples. DisCo fine-tunes flow-matching models using Group-Relative Policy Optimization (GRPO), guided by a compositional reward that: (i) penalizes facial similarity within images, (ii) discourages identity repetition across samples, (iii) enforces accurate person counts, and (iv) preserves visual fidelity via human preference scores. A single-stage curriculum stabilizes training as prompt complexity increases, requiring no additional annotations. On the DiverseHumans Testset, DisCo achieves 98.6% Unique Face Accuracy and near-perfect Global Identity Spread, outperforming both open-source and proprietary models (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish cross-sample diversity as a critical axis for resolving identity collapse in generative models, and position DisCo as a scalable, annotation-free solution for multi-human image synthesis.
PaperID: 1213,   Poster  https://arxiv.org/pdf/2603.06120    
Authors: Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li, Yu Zhang, Dazhou Li
Title: Dynamic Momentum Recalibration in Online Gradient Learning
Abstract: Stochastic Gradient Descent (SGD) and its momentumdriven variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Extensive experiments across diverse architectures and benchmarks demonstrate that SGDF outperforms conventional momentum-based methods and achieves performance on par with or surpassing state-of-the-art optimizers.
PaperID: 1214,   Poster  https://arxiv.org/pdf/2603.23272    
Authors: Xue Wang, Zheng Guan, Wenhua Qian, Chengchao Wang, Runzhuo MA
Title: Multi-Modal Image Fusion via Intervention-Stable Feature Learning
Abstract: Multimodal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl's causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other's missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.
PaperID: 1215,   Poster  https://arxiv.org/pdf/2512.13030    
Authors: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu
Title: Motus: A Unified Latent Action World Model
Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from largescale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (ie, understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (ie, world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi-0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
PaperID: 1216,   Poster  https://arxiv.org/pdf/2512.07459    
Authors: Xiangjun Tang, Biao Zhang, Peter Wonka
Title: Human Geometry Distribution for 3D Animation Generation
Abstract: Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with finegrained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods (90% lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics (2.2 x higher user study score), achieving the best results across all evaluation metrics.
PaperID: 1217,   Poster  https://arxiv.org/pdf/2603.28152    
Authors: Yuhuan Xie, Aoxuan Pan, Yihua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Qi
Title: ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
Abstract: Achieving precise, objectlevel control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.
PaperID: 1218,   Poster  https://arxiv.org/pdf/2603.07430    
Authors: Lei Jiang, Xin Liu, Xinze Tong, Zhiliang Li, Jie Liu, Jie Tang, Gangshan Wu
Title: Disentangled Textual Priors for Diffusion-based Image Super-Resolution
Abstract: Image SuperResolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content—from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.
PaperID: 1219,   Poster  https://arxiv.org/pdf/2511.20573    
Authors: Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi
Title: VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Abstract: This paper studies Visual Question–Visual Answering (VQVA): generating an image, rather than text, in response to a visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (\emphi.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\emphe.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, our work will greatly stimulate future research on VQ-VA.
PaperID: 1220,   Poster  https://arxiv.org/pdf/2512.12309    
Authors: Shenghao Fu, Yukun Su, Fengyun Rao, Jing LYU, Xiaohua Xie, Wei-Shi Zheng
Title: WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Abstract: Openvocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, i.e., matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency. We will open-source all models.
PaperID: 1221,   Poster  https://arxiv.org/pdf/2510.15831    
Authors: Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan O Arik
Title: VISTA: A Test-Time Self-Improving Video Generation Agent
Abstract: Despite rapid advances in textto-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 66.4% of comparisons.
PaperID: 1222,   Poster  https://arxiv.org/pdf/2603.12967    
Authors: WuDing Weng, Tongshu Wu, chen liucheng, Siyu xie, Zheng Wang, Xing Xu, Jingkuan Song, Heng Tao Shen
Title: Language-Grounded Decoupled Action Representation for Robotic Manipulation
Abstract: The heterogeneity between highlevel vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives—translation, rotation, and gripper control—providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.
PaperID: 1223,   Poster  https://arxiv.org/pdf/2601.16672    
Authors: LI MING, Hui Shan, Kai Zheng, Chentao Shen, Siyu Liu, Yanwei Fu, Zhen Chen, Xiangru Huang
Title: ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
Abstract: Highquality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods, typically rely on the unstructured representations, such as 3D Gaussian Splats, which struggle to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose ReWeaver, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The reconstructed seams and panels align precisely with the input images, and can be easily converted into simulation-ready and photorealistic 3D garments suitable for high-fidelity physics-based animation and virtual content creation. To enable effective training, we construct a large-scale dataset GCD-TS, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency.
PaperID: 1224,   Poster  https://arxiv.org/pdf/2505.19161    
Authors: Jialun Pei, Diandian Guo, Donghui Yang, Zhixi Li, Yuxin Feng, Long Ma, Bo Du, Pheng-Ann Heng
Title: Benchmarking Endoscopic Surgical Image Restoration and Beyond
Abstract: In endoscopic surgery, a clear and highquality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impairs visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. SurgClean comprises 3,113 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower restoration algorithms and improve the efficiency of clinical procedures. Data and code are available.
PaperID: 1225,   Poster  https://arxiv.org/pdf/2601.00759    
Authors: Zhaiyu Chen, Yuqing Wang, Xiao Xiang Zhu
Title: Unified Primitive Proxies for Structured Shape Completion
Abstract: Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitivebased surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data.
PaperID: 1226,   Poster  https://arxiv.org/pdf/2511.12834    
Authors: Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, Amit Roy-Chowdhury
Title: SAGA: Source Attribution of Generative AI Videos
Abstract: The proliferation of generative AI has led to hyperrealistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce \textcolorblue\textttSAGA (\underlineSource \underlineAttribution of \underlineGenerative \underlineAI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, \textcolorblue\textttSAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling \textcolorblue\textttSAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (\textcolorblue\textttT-Sig), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that \textcolorblue\textttSAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.
PaperID: 1227,   Poster  https://arxiv.org/pdf/2603.08021    
Authors: Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He
Title: AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
Abstract: Generating human grasping poses that accurately reflect both object geometry and userspecified interaction semantics is essential for natural hand–object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand–object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.
PaperID: 1228,   Poster  https://arxiv.org/pdf/2512.14266    
Authors: Shreedhar Govil, Didier Stricker, Jason Rambach
Title: DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Abstract: Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed humanautonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360^\circ field of view driver attention dataset, containing ~1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method will be made publicly available.
PaperID: 1229,   Poster  https://arxiv.org/pdf/2511.14019    
Authors: Kaichen Zhou, Laura Dodds, Sayed Afzal, Fadel Adib
Title: RISE: Single Static Radar-based Indoor Scene Understanding
Abstract: Robust and privacypreserving indoor scene understanding remains a fundamental open problem.While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments.In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult.We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection.RISE is built upon the key insight that multipath reflections—traditionally treated as noise—encode rich geometric cues.To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures.On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layouts reconstruction and object detection.Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to radar-based indoor scene understanding.Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU.These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar.
PaperID: 1230,   Poster  https://arxiv.org/pdf/2503.17182    
Authors: Patrick Rim, Hyoungseob Park, Vadim Ezhov, Changil Jeffrey Moon, Alex Wong
Title: Radar-Guided Polynomial Fitting for Metric Depth Estimation
Abstract: We propose POLAR, a novel radarguided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.
PaperID: 1231,   Poster  https://arxiv.org/pdf/2604.07966    
Authors: Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi
Title: Lighting-grounded Video Generation with Renderer-based Agent Reasoning
Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusionbased framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.
PaperID: 1232,   Poster  https://arxiv.org/pdf/2603.26127    
Authors: Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja
Title: Finding Distributed Object-Centric Properties in Self-Supervised Transformers
Abstract: Selfsupervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in \texttt[CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the \texttt[CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components (q, k, v), unlike prior work that uses only key features or the \texttt[CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.
PaperID: 1233,   Poster  https://arxiv.org/pdf/2602.23732    
Authors: Xinyi Qi, Kai Ye, Chengchun Shi, Ying Yang, Jin Zhu, Hongyi Zhou
Title: A Difference-in-Difference Approach to Detecting AI-Generated Images
Abstract: Diffusion models are able to produce AIgenerated images that are almost indistinguishable from real ones, raising concerns about their potential misuse and posing substantial challenges for detecting them. Many existing detectors rely on reconstruction error — the difference between the input image and its reconstructed version — as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error -- a second-order difference -- for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.
PaperID: 1234,   Poster  https://arxiv.org/pdf/2603.16641    
Authors: Zhenqi He, Lin Li, Long Chen
Title: FlowComposer: Composable Flows for Compositional Zero-Shot Learning
Abstract: Compositional zeroshot learning (CZSL) aims to recognize unseen attribute–object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT).They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions.However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models.In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals.We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.
PaperID: 1235,   Poster  https://arxiv.org/pdf/2603.12493    
Authors: Ali Mosleh, Faraz Ali, Fengjia Zhang, Stavros Tsogkas, Junyong Lee, Michael S. Brown, Alex Levinshtein
Title: RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
Abstract: Digital zoom on smartphones relies on learningbased super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations. We will make our calibrated kernels and noise models publicly available, to facilitate research on image enhancement for mobile photography.
PaperID: 1236,   Poster  https://arxiv.org/pdf/2603.01142    
Authors: Penghao Wang, Siyuan Xie, Jiawei Zhou, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu
Title: ArtLLM: Generating Articulated Assets via 3D LLM
Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimizationbased reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object’s point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
PaperID: 1237,   Poster  https://arxiv.org/pdf/2511.10914    
Authors: Zihan Gu, Ruoyu Chen, Junchi Zhang, Yue Hu, Hua Zhang, Xiaochun Cao
Title: PhaseWin Search Framework Enable Efficient Object-Level Interpretation
Abstract: Attribution is essential for interpreting objectlevel foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.
PaperID: 1238,   Poster  https://arxiv.org/pdf/2505.21147    
Authors: Xuanning Zhou, Zihao Shi, Hao Zeng, Xiaobo Xia, Bingyi Jing, Hongxin Wei
Title: Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
Abstract: Conformal prediction (CP) is a powerful framework for uncertainty quantification, generating prediction sets with coverage guarantees. Split conformal prediction relies on labeled data in the calibration procedure. However, the labeled data is often limited in realworld scenarios, leading to unstable coverage performance in different runs. To address this issue, we extend CP to the semi-supervised setting and propose SemiCP, a new paradigm that leverages both labeled and unlabeled data for calibration. To achieve this, we introduce an unlabeled nonconformity score, Nearest Neighbor Matching (NNM) score. Specifically, NNM estimates the nonconformity scores of unlabeled samples using their most similar pseudo-labeled counterparts during calibration, while maintaining the original scores for labeled data. Theoretically, we demonstrate that the average coverage gap (i.e., the absolute difference between the empirical marginal coverage and the target coverage) of SemiCP can decrease significantly at a rate \mathcalO\bigl(1/N\bigr), where N is the number of unlabeled data. Extensive experiments validate the effectiveness of SemiCP under limited labeled data, reducing the average coverage gap by up to 77% on common benchmarks with 4000 unlabeled examples, when there are only 20 labeled examples.
PaperID: 1239,   Poster  https://arxiv.org/pdf/2604.09232    
Authors: Zizhao Li, Zhengkang Xiang, Jiayang Ao, Feng Liu, Joseph West, Kourosh Khoshelham
Title: Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Abstract: LiDARbased perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise–based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31% on the STU test set, which is more than 10× higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.
PaperID: 1240,   Poster  https://arxiv.org/pdf/2601.09211    
Authors: Chunghyun Park, Seunghyeon Lee, Minsu Cho
Title: Affostruction: 3D Affordance Grounding with Generative Reconstruction
Abstract: This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose a unified framework for affordance grounding and reconstruction, dubbed Affostruction, where affordance grounding actively combines with shape generation. In our approach, reconstructing complete geometry from partial observations enables affordance prediction on unobserved regions, while affordance heatmaps guide active view selection to improve reconstruction quality of functional regions. We make three core contributions: generative multiview reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling.Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.
PaperID: 1241,   Poster  https://arxiv.org/pdf/2602.19254    
Authors: Bowen Chen, Jake Zuena, Alan Bovik, Divya Kothandaraman
Title: RegionRoute: Regional Style Transfer with Diffusion Model
Abstract: Precise spatial control in diffusionbased style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
PaperID: 1242,   Poster  https://arxiv.org/pdf/2505.05212    
Authors: Xiaotong Yu, Chang-Wen Chen
Title: HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
Abstract: Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including samplingbased and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves 7.9-49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.
PaperID: 1243,   Poster  https://arxiv.org/pdf/2603.01228    
Authors: Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou
Title: Towards Policy-Adaptive Image Guardrail: Benchmark and Method
Abstract: Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Visionlanguage models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs withSafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe–unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduceSafeGuard-VL, a reinforcement learning–based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
PaperID: 1244,   Poster  https://arxiv.org/pdf/2506.18496    
Authors: Seonghak Kim
Title: Distilling Balanced Knowledge from a Biased Teacher
Abstract: Conventional knowledge distillation, designed for model compression, fails on longtailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss to calibrates the teacher's group-level predictions and (2) a reweighted within-group loss to ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby proving its ability to distill balanced knowledge from a biased teacher for real-world applications.
PaperID: 1245,   Poster  https://arxiv.org/pdf/2602.19019    
Authors: Li Zhang, Shruti Agarwal, John Collomosse, Pengtao Xie, Vishal Asnani
Title: TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
Abstract: Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multiconcept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.
PaperID: 1246,   Poster  https://arxiv.org/pdf/2511.17097    
Authors: Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, kaihui.wang kaihui.wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, Zhaoxin Fan
Title: Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Abstract: VisionLanguage Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction.However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences.Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation.To achieve this without annotations, we propose a three-stage framework.In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes.Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions.Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives.Experiments on R2R-CE and RxR-CE show substantial gains in success, efficiency, and interpretability, demonstrating semantic progress provides a more consistent and generalizable representation of navigation advancement.
PaperID: 1247,   Poster  https://arxiv.org/pdf/2603.26192    
Authors: Xuerui Zhang, Xuehao Wang, Zhan Zhuang, Linglan Zhao, Ziyue Li, Xinmin Zhang, Zhihuan Song, Yu Zhang
Title: HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
Abstract: Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (e.g., only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL).Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures.To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario.To this end, we propose the HeterogeneityAware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase.The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator.Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.
PaperID: 1248,   Poster  https://arxiv.org/pdf/2604.08337    
Authors: Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Dem, Norimasa Kobori, Quan Kong
Title: InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
Abstract: Current vision–language pretraining (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global image–text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial–temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.
PaperID: 1249,   Poster  https://arxiv.org/pdf/2603.12149    
Authors: Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, BOYU YANG, Ming Kong, Jie Liu, Qiang Zhu
Title: Linking Perception, Confidence and Accuracy in MLLMs
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority. Our code will be released after the acception.
PaperID: 1250,   Poster  https://arxiv.org/pdf/2603.20185    
Authors: Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum
Title: VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Abstract: Video agentic models have substantially advanced videolanguage understanding performance. However, most agentic approaches heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. Instead, we argue that leveraging the logical flow of videos allows models to use far fewer frames while maintaining, or even improving, their video understanding capability. In this paper, we introduce VideoSeek, a long-horizon video agent that actively seeks informative content via tool use, conditioned on the underlying logic flows throughout videos. Specifically, the VideoSeek agent follows a think–act–observe loop: it reasons over collected evidence to determine a tool-using plan, then acts by calling tools to gather new observations, and stops once it is sufficient to answer the given question. Experiments on four long-form video understanding and complex reasoning benchmarks demonstrate the superiority of VideoSeek. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Futher, a comprehensive analysis highlights the significance of leveraging logic flow, strong reasoning capability, and toolkit design for video agents.
PaperID: 1251,   Poster  https://arxiv.org/pdf/2511.19315    
Authors: Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu
Title: Rethinking Intermediate Representation for VLM-based Robot Manipulation
Abstract: VisionLanguage Model (VLM) is now an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar structure, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. Also, we design a novel open-vocabulary segmentation paradigm with an in-context learning strategy to precisely localize fine-grained object parts for manipulation (e.g., cup handle, teapot opening) effectively with the shortest inference time over all state-of-the-art parallel works. We then formulate new metrics for action-generalizability and VLM-comprehensibility to evaluate mainstream representations, demonstrating the strong performance of SEAM on both aspects. Extensive realworld experiments further manifest the SOTA performance of SEAM under varying settings and tasks.
PaperID: 1252,   Poster  https://arxiv.org/pdf/2510.02226    
Authors: Shira Schiber, Ofir Lindenbaum, Idan Schwartz
Title: TempoControl: Temporal Attention Guidance for Text-to-Video Models
Abstract: Recent advances in generative video models have enabled the creation of highquality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation.
PaperID: 1253,   Poster  https://arxiv.org/pdf/2603.00461    
Authors: Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
Title: ReMoT: Reinforcement Learning with Motion Contrast Triplets
Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatiotemporal consistency—a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (i) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (ii) Group Relative Policy Optimization, which we empirically validate, yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves SOTA performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1 performance leap on spatio-temporal reasoning tasks.
PaperID: 1254,   Poster  https://arxiv.org/pdf/2511.18838    
Authors: Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, Dingkang Liang
Title: Visual Autoregressive Modeling via Next Focus Prediction
Abstract: Visual autoregressive models achieve remarkable generation quality through nextscale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present FVAR, which reframes the paradigm from \emphnext-scale prediction to \emphnext-focus prediction, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: 1) \emphNext-Focus Prediction Paradigm that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; 2) \emphProgressive Refocusing Pyramid Construction that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and 3) \emphHigh-Frequency Residual Learning that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.
PaperID: 1255,   Poster  https://arxiv.org/pdf/2506.21349    
Authors: Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang
Title: Electromagnetic Inverse Scattering from a Single Transmitter
Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently illposed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires time-consuming case-specific optimization and fails under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, it achieves high-quality results even with a single transmitter, a setting where previous methods consistently fail. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.
PaperID: 1256,   Poster  https://arxiv.org/pdf/2511.15200    
Authors: Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, Changliu Liu, Guanya Shi, Linxi Fan, Yuke Zhu
Title: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation
Abstract: A core barrier to the realworld productivity of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization—over lighting, materials, camera parameters, image quality, and sensor delays—with real-to-sim alignment of the dexterous hand and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.
PaperID: 1257,   Poster  https://arxiv.org/pdf/2603.27573    
Authors: Minzhang Li, Kuixiang Shao, xuebing li, Yuyang Jiao, Yinuo Bai, Hengan Zhou, Sixian Shen, Jiayuan Gu, Jingyi Yu
Title: SPREAD: Spatial-Physical Reasoning via gEometry Aware Diffusion
Abstract: Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror realworld complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-relational reasoning and physical metrics. Moreover, our method significantly outperforms baselines in scene consistency and stability during pre- and post-physics simulation, proving its capability to generate simulation-ready environments for embodied AI agents.
PaperID: 1258,   Poster  https://arxiv.org/pdf/2601.09499    
Authors: Edgar Sucar, Eldar Insafutdinov, Zihang Lai, Andrea Vedaldi
Title: V-DPM: Video Reconstruction with Dynamic Point Maps
Abstract: New, powerful 3D representations such as DUSt3R’s invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feedforward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content, also representing 3D scene motion.However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimization when more than two views are involved. We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this.First, we show how to set up DPMs for videos to optimize their representational power, ease of neural prediction, and reuse of pre-trained models. Second, we implement these ideas on top of VGGT, a recent state-of-the-art 3D reconstructor. Although VGGT was trained on static scenes, we show that a small amount of synthetic data suffices to adapt it into an effective V-DPM predictor.This yields state-of-the-art 3D and 4D reconstruction in dynamic settings. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs reconstruct not only dynamic depth but also the full 3D motion of every point in the scene.
PaperID: 1259,   Poster  https://arxiv.org/pdf/2603.02943    
Authors: Shaoxuan He, Benlei Cui, Bukun Huang, Zhizeng Ye, Yunyun Sun, Longtao Huang, Hui Xue, Yang Yang, Haiwen Hong, Jingqun Tang, Zhou Zhao
Title: TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
Abstract: Despite achieving stateof-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé(TC-Padé) approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88× acceleration on FLUX.1-dev and 1.72× on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.
PaperID: 1260,   Poster  https://arxiv.org/pdf/2604.16630    
Authors: Craig Iaboni, Pramod Abichandani
Title: Tri-Modal Fusion Transformers for UAV-based Object Detection
Abstract: Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal longwave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise–pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector.We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB–thermal–event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.
PaperID: 1261,   Poster  https://arxiv.org/pdf/2512.11395    
Authors: Yilei Jiang, Zhen Wang, Yanghao Wang, Jun Yu, Yueting Zhuang, Jun Xiao, Long Chen
Title: FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
Abstract: With the surge of pretrained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially forsimple editingthat only contains a single editing target. However, to satisfy the exploding editing requirements, thecomplex editingthat contains multiple editing targets is posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency.In this paper, we proposeFlowDC, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency.To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.
PaperID: 1262,   Poster  https://arxiv.org/pdf/2512.09299    
Authors: Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, Wentao Zhang
Title: VABench: A Comprehensive Benchmark for Audio-Video Generation
Abstract: Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audiovideo generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
PaperID: 1263,   Poster  https://arxiv.org/pdf/2603.25819    
Authors: Yancheng Zhang, Xiaohan Zhang, Guangyu Sun, Zonglin Lyu, Safwan Wshah, Chen Chen
Title: Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Abstract: Crossview geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.
PaperID: 1264,   Poster  https://arxiv.org/pdf/2512.03052    
Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, Xiangyu Yue
Title: LATTICE: Democratize High-Fidelity 3D Generation at Scale
Abstract: We present LATTICE, a new framework for highfidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a recitified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.
PaperID: 1265,   Poster  https://arxiv.org/pdf/2512.14126    
Authors: Junyi Wu, Van Nguyen Nguyen, Benjamin Planche, Jiachen Tao, Changchang Sun, Zhongpai Gao, Zhenghao Zhao, Anwesa Choudhuri, Gengyu Zhang, Meng Zheng, Feiran Wang, Terrence Chen, Yan Yan, Ziyan Wu
Title: Consistent Instance Field for Dynamic Scene Understanding
Abstract: We introduce Consistent Instance Field, a continuous and probabilistic spatiotemporal representation for dynamic scene understanding.Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space–time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization.Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
PaperID: 1266,   Poster  https://arxiv.org/pdf/2603.27967    
Authors: Suchae Jeong, Jaehwi Song, Haeone Lee, Hanna Kim, Jian Kim, Dongjun Lee, Dong Shin, Changyeon Kim, Dongyoon Hahm, Woogyeol Jin, Juheon Choi, Kimin Lee
Title: Learning Multi-View Spatial Reasoning from Cross-View Relations
Abstract: Visionlanguage models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.
PaperID: 1267,   Poster  https://arxiv.org/pdf/2602.20309    
Authors: Jingxuan Zhang, Yun-Ta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
Title: QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Abstract: Vision–language–action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a trainingfree post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models, QuantVLA surpasses full-precision baselines, substantially reduces memory usage within quantized modules, and lowers inference latency, providing a practical pathway toward scalable, low-bit embodied intelligence under strict compute, memory, and power constraints.
PaperID: 1268,   Poster  https://arxiv.org/pdf/2512.22323    
Authors: ZHIBIN QIN, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang
Title: SpotEdit: Selective Region Editing in Diffusion Transformers
Abstract: Diffusion Transformer (DiT)based models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
PaperID: 1269,   Poster  https://arxiv.org/pdf/2512.01759    
Authors: Zhuoqian Yang, Mathieu Salzmann, Sabine Süsstrunk
Title: Weight Space Representation Learning with Neural Fields
Abstract: In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pretrained base model and multiplicative low-rank adaptation (mLoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that mLoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, mLoRA weights enable higher-quality generation than existing weight-space methods. Source code will be made publicly available.
PaperID: 1270,   Poster  https://arxiv.org/pdf/2511.18437    
Authors: Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, Jing Zhang
Title: Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to VisionLanguage Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist---a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.
PaperID: 1271,   Poster  https://arxiv.org/pdf/2512.12372    
Authors: Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi
Title: STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multishot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP^2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.
PaperID: 1272,   Poster  https://arxiv.org/pdf/2603.02256    
Authors: Kejia Yin, Zhihao Shi, Weilin Wan, Yuhongze Zhou, YUANHAO YU, Xinxin Zuo, Qiang Sun, Juwei Lu
Title: CamDirector: Towards Long-Term Coherent Video Trajectory Editing
Abstract: Video (camera) trajectory editing aims to synthesize new videos that follow userdefined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement.2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.
PaperID: 1273,   Poster  https://arxiv.org/pdf/2602.19516    
Authors: Ruikun Li, Jun Yao, Yingfan Hua, SHIXIANG TANG, Biqing Qi, Bin Liu, Wanli Ouyang, Yan Lu
Title: Pixel2Phys: Distilling Governing Laws from Visual Dynamics
Abstract: Discovering physical laws directly from highdimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific intelligence. This task is inherently difficult because physical knowledge is low-dimensional and structured, whereas raw video observations are high-dimensional and redundant, with most pixels carrying little or no physical meaning. Extracting concise, physically relevant variables from such noisy data remains a key obstacle. To address this, we propose Pixel2Phys, a collaborative multi-agent framework adaptable to any Multimodal Large Language Model (MLLM). It emulates human scientific reasoning by employing a structured workflow to extract formalized physical knowledge through iterative hypothesis generation, validation, and refinement.By repeatedly formulating, and refining candidate equations on high-dimensional data, it identifies the most concise representations that best capture the underlying physical evolution. This automated exploration mimics the iterative workflow of human scientists, enabling AI to reveal interpretable governing equations directly from raw observations. Across diverse simulated and real-world physics videos, Pixel2Phys discovers accurate, interpretable governing equations and maintaining stable long-term extrapolation where baselines rapidly diverge.
PaperID: 1274,   Poster  https://arxiv.org/pdf/2512.05959    
Authors: David Anugraha, Patrick Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata
Title: M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Abstract: Vision–language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. RetrievalAugmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image–question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
PaperID: 1275,   Poster  https://arxiv.org/pdf/2602.05217    
Authors: Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu
Title: CrossDomain Few-Shot Segmentation via Multi-view Progressive Adaptation
Abstract: CrossDomain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
PaperID: 1276,   Poster  https://arxiv.org/pdf/2511.18444    
Authors: Arpit Garg, Hemanth Saratchandran, Simon Lucey
Title: SineProject: Machine Unlearning for Stable Vision–Language Alignment
Abstract: Multimodal Large Language Models (MLLMs) increasingly need to forget specific knowledge, such as unsafe or private information, without full retraining. However, existing unlearning methods often disrupt vision–language alignment, causing models to reject both harmful and benign queries simultaneously. We trace this failure to the projector network: during unlearning, its Jacobian becomes severely illconditioned, leading to unstable optimization and drift in cross-modal embeddings. We introduce SineProject, a simple approach that augments the frozen projector with sinusoidally modulated trainable parameters that improve the Jacobian’s spectral conditioning and stabilize alignment throughout unlearning. Evaluated across standard safety and privacy unlearning benchmarks using LLaVA-v1.5-7B and 13B, SineProject reduces benign-query refusals while achieving complete forgetting of targeted information, delivering state-of-the-art forget–retain trade-offs with negligible computational overhead
PaperID: 1277,   Poster  https://arxiv.org/pdf/2603.26810    
Authors: Qi Zhang, Denis Rozumny, Francesco Girlanda, Sezer Karaoglu, Marc Pollefeys, Theo Gevers, Martin R. Oswald
Title: Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
Abstract: We propose UnblurSLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses.Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.
PaperID: 1278,   Poster  https://arxiv.org/pdf/2603.05971    
Authors: Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Title: Towards High-resolution and Disentangled Reference-based Sketch Colorization
Abstract: Sketch colorization models have been widely studied to automate and assist in the creation of animation frames and digital illustrations. However, current methods are still not satisfactory for industrial standard applications in highresolution synthesis and precise controllability of details. To further enhance the synthesis quality and controllability, we propose an image-referenced sketch colorization method based on the powerful SDXL backbone and leverage sketches as spatial guidance and RGB images as color references. A split cross-attention mechanism is coupled with spatial masks to separately colorize the foreground and background regions to avoid spatial entanglement. A tagger network trained on a massive anime-style image dataset is employed to extract attribution-level information from reference images and integrated into the pipeline to provide precise control signals for synthesis. However, the increased resolution and number of attention layers in the SDXL backbone and precise reference information from the tagger network cause severe entanglement during colorization. We consequently combine a foreground encoder and a background encoder for disentanglement and better synthesis quality. Furthermore, a high-quality annotated and paired sketch colorization dataset is collected for fine-tuning. The proposed method is the first to achieve high resolution high quality sketch colorization with precise control, and obvious outperforms existing methods in quantitative and qualitative validations, as well as user studies in both quality and controllability. Ablation study reveals the influence of each component. Code and dataset will be made publicly available upon paper acceptance.
PaperID: 1279,   Poster  https://arxiv.org/pdf/2512.19048    
Authors: Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim
Title: WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learningbased watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder–decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
PaperID: 1280,   Poster  https://arxiv.org/pdf/2511.19435    
Authors: Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang
Title: Are Image-to-Video Models Good Zero-Shot Image Editors?
Abstract: Largescale video diffusion models exhibit strong world-simulation and temporal reasoning capabilities, yet their potential as zero-shot image editors remains underexplored. We present \ifeditIF-Edit (Image Edit by Generating Frames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. \ifeditIF-Edit addresses three core obstacles—prompt misalignment, redundant temporal latents, and blurry late-stage frames—via: (1) a Chain-of-Thought Prompt Enhancement module that reformulates static editing instructions into temporally grounded reasoning prompts; (2) a Temporal Latent Dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving global semantics and temporal coherence; and (3) a Self-Consistent Post-Refinement step that refines the sharpest late-stage frame through a brief still-video trajectory, leveraging the video prior for sharper and more faithful results. Extensive experiments across four public benchmarks—covering non-rigid deformations, physical and temporal reasoning, and general instruction editing—show that \ifeditIF-Edit achieves strong performance on non-rigid and reasoning-centric tasks while remaining competitive on general-purpose edits. Our study offers a systematic view of video diffusion models as image editors, revealing their unique strengths, limitations, and a simple recipe for unified video–image generative reasoning.
PaperID: 1281,   Poster  https://arxiv.org/pdf/2511.14152    
Authors: Laura Dodds, Maisy Lam, Waleed Akbar, Yibo Cheng, Fadel Adib
Title: Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
Abstract: We present WaveFormer, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data. In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.
PaperID: 1282,   Poster  https://arxiv.org/pdf/2603.11675    
Authors: Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan, Dejia Song, Yibo Chen, Xu Tang, Yao Hu, Lu Sheng, Zhiyong Wu
Title: PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
Abstract: Virtual Tryon (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge.We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors.We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.
PaperID: 1283,   Poster  https://arxiv.org/pdf/2512.03794    
Authors: Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
Title: AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Abstract: VisionLanguage Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
PaperID: 1284,   Poster  https://arxiv.org/pdf/2603.24934    
Authors: Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im
Title: CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Abstract: We propose Contextaware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. %CVA integrates complementary data-centric and architectural innovations to enhance contextual robustness. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the false negative caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.
PaperID: 1285,   Poster  https://arxiv.org/pdf/2601.17950    
Authors: Matthew Walmer, Saksham Suri, Anirud Aggarwal, Abhinav Shrivastava
Title: UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
Abstract: The space of taskagnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.
PaperID: 1286,   Poster  https://arxiv.org/pdf/2512.24426    
Authors: Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, Boyi Li, Yan Wang, Marco Pavone
Title: Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
Abstract: Recent reasoningaugmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, then performs a counterfactual reasoning pass conditioned on both the meta-actions and the visual. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflection capabilities, we propose a rollout–filter–label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent counterfactual training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6%, enhances safety metrics, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.
PaperID: 1287,   Poster  https://arxiv.org/pdf/2602.23205    
Authors: Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
Title: EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Abstract: Human behaviors in the real world naturally encode rich, longterm contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
PaperID: 1288,   Poster  https://arxiv.org/pdf/2512.12090    
Authors: Samar Fares, Nurbek Tastan, Karthik Nandakumar
Title: SPDMark: Selective Parameter Displacement for Robust Video Watermarking
Abstract: The advent of highquality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called \textttSPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of \textttSPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.
PaperID: 1289,   Poster  https://arxiv.org/pdf/2510.20586    
Authors: Muhammad Atif Butt, Alexandra Gomez-Villa, Tao Wu, Javier Vazquez-Corral, Joost van de Weijer, Kai Wang
Title: GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation
Abstract: Recent years have seen impressive advances in textto-image generation, with image generative or unified models, generating high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess the color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities like interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for T2I color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models’ true capabilities via perceptual and automated assessments. Evaluations of popular T2I models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will allow to guide improvements in precise color generation. The benchmark will be made public upon acceptance.
PaperID: 1290,   Poster  https://arxiv.org/pdf/2603.29301    
Authors: Jiaju Ma, R. Kenny Jones, Jiajun Wu, Maneesh Agrawala
Title: Self-Consistency for LLM-based Motion Trajectory Generation and Verification
Abstract: Selfconsistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains; specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., ``Move the circle in a spiral path''), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4–6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines.
PaperID: 1291,   Poster  https://arxiv.org/pdf/2601.16771    
Authors: Jiahao Li, Yunpeng Bai, Yongkang Dai, Hao Guo, Hongping Gan, Yilei Shi
Title: AutoRegressive Generation with B-rep Holistic Token Sequence Representation
Abstract: Previous representation and generation approaches for the Brep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.
PaperID: 1292,   Poster  https://arxiv.org/pdf/2604.05621    
Authors: Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath
Title: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
Abstract: We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGBD interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10× lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction. All code and data will be released.
PaperID: 1293,   Poster  https://arxiv.org/pdf/2603.06932    
Authors: Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, OCTAVIA CAMPS, Pu Zhao, Jianyang Gu
Title: HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
Abstract: Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original largescale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HierAmp to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HierAmp consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.
PaperID: 1294,   Poster  https://arxiv.org/pdf/2509.13762    
Authors: CHEN KAI, Jin Xiao, Leheng Zhang, Kexuan Shi, Shuhang Gu
Title: Task-Aware Image Signal Processor for Advanced Visual Perception
Abstract: In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional lowbit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processor (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.
PaperID: 1295,   Poster  https://arxiv.org/pdf/2508.21496    
Authors: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
Title: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding
Abstract: We revisit video hallucination in multimodal large language models (VideoMLLMs) from a semantic aggregation perspective. While prior work attributes hallucinations to language priors, missing frames, or visual encoder biases, these explanations overlook errors arising during the aggregation of correct frame-level semantics into event-level interpretations. We term this phenomenon Semantic Aggregation Hallucination (SAH), which becomes increasingly prevalent in complex, multi-event video understanding tasks with rich temporal dependencies. To systematically study SAH, we introduce ELV-Halluc, the first benchmark designed for fine-grained evaluation of semantic aggregation errors. Our experiments reveal that SAH correlates with both semantic complexity and rapid semantic transitions. We further propose mitigation strategies: improved positional encoding preserves temporal structure, and reinforcement learning such as DPO enhances the model’s ability to distinguish semantics within and across events. Using a curated 8K adversarial video-text pair dataset, our approach achieves consistent gains across benchmarks, including a 27.7% reduction in SAH rate on ELV-Halluc and Video-MME.
PaperID: 1296,   Poster  https://arxiv.org/pdf/2512.17492    
Authors: Oskar Kristoffersen, Alba Reinders, Morten Hannemose, Anders Dahl, Dim Papadopoulos
Title: MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
Abstract: Geospatial analysis of our world benefit from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates from 18,557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models on various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval.We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding. The dataset, labels, and code will be released upon acceptance.
PaperID: 1297,   Poster  https://arxiv.org/pdf/2511.14899    
Authors: Daniel Gilo, Or Litany
Title: InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Abstract: We address the task of multiview image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views.Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.
PaperID: 1298,   Poster  https://arxiv.org/pdf/2511.20721    
Authors: Guillaume Letellier, Siddharth Srivastava, Frederic Jurie, Gaurav Sharma
Title: Foundry: Distilling 3D Foundation Models for the Edge
Abstract: Foundation models pretrained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks—classification, part segmentation, and few-shot scenarios—approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource-constrained hardware.
PaperID: 1299,   Poster  https://arxiv.org/pdf/2512.02441    
Authors: Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee
Title: Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
Abstract: Adapting large pretrained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace.In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients—along with a lightweight rescaling step—while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.
PaperID: 1300,   Poster  https://arxiv.org/pdf/2510.27606    
Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Abstract: Spatial understanding remains a weakness of Large VisionLanguage Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
PaperID: 1301,   Poster  https://arxiv.org/pdf/2512.11438    
Authors: Tariq Berrada, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen
Title: Flowception: Temporally Expansive Flow Matching for Video Generation
Abstract: We present Flowception, a novel nonautoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
PaperID: 1302,   Poster  https://arxiv.org/pdf/2601.08617    
Authors: Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz
Title: SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Abstract: With the increasing adoption of visionlanguage models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber‑based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality‑based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
PaperID: 1303,   Poster  https://arxiv.org/pdf/2512.18312    
Authors: Zeyu Zhang, Wei Zhai, Jian Yang, Yang Cao
Title: MatE: Material Extraction from Single-Image via Geometric Prior
Abstract: The creation of highfidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.
PaperID: 1304,   Poster  https://arxiv.org/pdf/2603.19158    
Authors: Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim
Title: Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
Abstract: Diffusionbased text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) — a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.
PaperID: 1305,   Poster  https://arxiv.org/pdf/2512.10821    
Authors: Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan, Enming Luo, Chun-Ta Lu, Tushar Dogra, Ranjay Krishna, Ariel Fuxman
Title: Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Abstract: From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding.Existing humanin-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called Agile Deliberation that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user’s evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F_1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
PaperID: 1306,   Poster  https://arxiv.org/pdf/2511.19476    
Authors: Jin Cui, Boran Zhao, Jiajun Xu, Jiaqi guo, Shuo Guan, Pengju Ren
Title: FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
Abstract: Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNNbased, which are inherently coupled with network-specific parameters, inevitably introducing architectural bias and compromising generalization; or (ii) DNN-free, which utilize heuristics that lack rigorous theoretical guarantees for stability and accuracy. Neither approach explicitly constrains distributional equivalence of the representative subsets, largely because continuous distribution matching is broadly considered inapplicable to discrete dataset sampling. Furthermore, prevalent distribution metrics (e.g., MSE, KL, MMD, and CE) are often incapable of accurately capturing higher-order moment differences. These deficiencies lead to suboptimal coreset performance, preventing the selected coreset from being truly equivalent to the original dataset.In this work, we propose FAST (Frequency-domain Aligned Sampling via Topology), the first DNN-free distribution-matching coreset selection framework that formulates coreset selection as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information (i.e., all moments and intrinsic correlations) in the frequency domain. We further discover that naive CFD suffers from a “vanishing phase gradient” issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high. This preserves global structures before refining local details, enabling accurate matching with few frequencies while preventing overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2× average speedup even on CPU with 1.7GB of memory, underscoring its high performance and energy efficiency.
PaperID: 1307,   Poster  https://arxiv.org/pdf/2511.11851    
Authors: Wei-Jia Chen, Min-Yan Tsai, Cheng-Yi Lee, Chia-Mu Yu
Title: Defending Unauthorized Model Merging via Dual-Stage Weight Protection
Abstract: Traditional multitask learning often relies on separately fine-tuned models for each task, leading to high training costs and inefficiency. Recent advances in model merging alleviate this issue by linearly combining parameters from multiple task-specific models to create new multi-task models. Such approaches can match or even surpass fine-tuning performance while greatly reducing computational overhead. However, the increasing openness of model-sharing platforms also introduces intellectual property risks. Malicious users can easily merge publicly available models to build new commercial systems without authorization, undermining the rights of original developers. To address this emerging threat, we propose MergeGuard, a two-stage preprocessing mechanism that protects models against unauthorized merging. MergeGuard subtly adjusts a model’s internal parameter structure to maintain its original task performance while degrading the performance of any merged derivatives. The key challenge lies in defending against unpredictable merging behaviors, as the attacker’s chosen models, strategies, and tasks remain unknown. MergeGuard effectively achieves this balance—ensuring normal functionality before merging but significant performance degradation afterward—to safeguard model ownership in open AI ecosystems.
PaperID: 1308,   Poster  https://arxiv.org/pdf/2603.02945    
Authors: Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu, Yao Shu, Chengwei Qin
Title: ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Abstract: Model merging aims to combine multiple taskspecific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce ACE-Merging, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that ACE-Merging sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, ACE-Merging achieves an average absolute improvement of 4% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, ACE-Merging delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
PaperID: 1309,   Poster  https://arxiv.org/pdf/2602.22779    
Authors: Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna
Title: TrajTok: Learning Trajectory Tokens enables better Video Understanding
Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While the recent trajectorybased tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex, external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight, efficient, and yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision–language models (TrajVLM) with especially strong performance in long-video reasoning.
PaperID: 1310,   Poster  https://arxiv.org/pdf/2603.20012    
Authors: Zheng Gao, Debin Meng, Yunqi Miao, Zhensong Zhang, Songcen Xu, Ioannis Patras, Jifei Song
Title: Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
Abstract: Current diffusionbased makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.
PaperID: 1311,   Poster  https://arxiv.org/pdf/2508.17034    
Authors: Jiayi Li, Yuxin Yao, Qiuhang Lu, Juyong Zhang
Title: DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
Abstract: Noisy, partially overlapping data and the need for realtime processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism that incorporates a computationally lightweight single-point RANSAC algorithm followed by a refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as shown by achieving up to a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. The code will be made publicly available.
PaperID: 1312,   Poster  https://arxiv.org/pdf/2601.09665    
Authors: Yuchen Wu, Jiahe Li, Xiaohan Yu, Lina Yu, Jin Zheng, Xiao Bai
Title: SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
Abstract: Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resourceconstrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
PaperID: 1313,   Poster  https://arxiv.org/pdf/2512.19949    
Authors: Zixuan Huang, Xiang Li, Zhaoyang Lv, James Rehg
Title: How Much 3D Do Video Foundation Models Encode?
Abstract: Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first modelagnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
PaperID: 1314,   Poster  https://arxiv.org/pdf/2602.24014    
Authors: Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein, Hyunjung Shim
Title: Interpretable Debiasing of Vision-Language Models for Social Fairness
Abstract: The rapid advancement of VisionLanguage models (VLMs) has raised growing concern that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DEBIASLENS, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are less represented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
PaperID: 1315,   Poster  https://arxiv.org/pdf/2511.18706    
Authors: Zhaoyang Jia, Zihan Zheng, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Houqiang Li, Yan Lu
Title: CoD: A Diffusion Foundation Model for Image Compression
Abstract: Existing diffusion codecs typically build on textto-image diffusion foundation models like Stable Diffusion.However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates.To address it, we introduce CoD, the first Compression-oriented Diffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs.It offers several advantages: High compression efficiency, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); Low-cost and reproducible training, 300× faster training than Stable Diffusion (~ 20 vs. ~ 6,250 A100 GPU days) on entirely open image-only datasets; Providing new insights, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters.We hope CoD lays the foundation for future diffusion codec research.Codes will be released.
PaperID: 1316,   Poster  https://arxiv.org/pdf/2602.16968    
Authors: Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde
Title: DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
Abstract: Diffusion Transformers (DiTs) have achieved stateof-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity.We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52× and 3.2× speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.
PaperID: 1317,   Poster  https://arxiv.org/pdf/2603.24166    
Authors: Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao
Title: Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
Abstract: Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for datarich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision–language understanding.
PaperID: 1318,   Poster  https://arxiv.org/pdf/2603.27105    
Authors: Girish Ganesan Ganesan, Yuliang Guo, Liu Ren, Xiaoming Liu
Title: UniDAC: Universal Metric Depth Estimation for Any Camera
Abstract: Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in realworld applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and 360^\circ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-\phi, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state-of-the-art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.
PaperID: 1319,   Poster  https://arxiv.org/pdf/2603.12903    
Authors: Yinuo Jiang, Jun Cheng, Yiran Wang, Cheng Cheng
Title: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
Abstract: Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SGNLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.
PaperID: 1320,   Poster  https://arxiv.org/pdf/2512.03126    
Authors: Shan Zhang, Aotian Chen, Kai Zou, Jindong Gu, Yuan Xue, Anton van den Hengel
Title: Hierarchical Process Reward Models are Symbolic Vision Learners
Abstract: Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixelbased visual models. Symbolic visual learners parse diagrams into geometric primitives—points, lines, and shapes—whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is SymHPR (Symbolic Hierarchical Process Reward Modeling), which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
PaperID: 1321,   Poster  https://arxiv.org/pdf/2603.03564    
Authors: Shengqiong Wu, Lanhu Wu, Mingyang Bao, Wenhao Xu, Hanwang Zhang, Shuicheng Yan, Hao Fei, Tat-seng Chua
Title: Modeling Cross-vision Synergy for Unified Large Vision Model
Abstract: Recent advances in large vision models (LVMs) have shifted from modalityspecific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on ten benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic large vision models.
PaperID: 1322,   Poster  https://arxiv.org/pdf/2604.04444    
Authors: Weihao Cao, Runqi Wang, Xiaoyue Duan, Jinchao Zhang, Ang Yang, Liping Jing
Title: Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
Abstract: Openvocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts.This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label.To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels.Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model.
PaperID: 1323,   Poster  https://arxiv.org/pdf/2603.11525    
Authors: Jian Zou, Xiaoyu Xu, Zhihua Wang, Yilin Wang, Balu Adsumilli, Kede Ma
Title: MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
Abstract: Recent advances in learningbased video quality assessment (VQA) have achieved remarkable progress, yet the two fundamental components, model and data, are often studied in isolation.Model-centric approaches tend to design superior architectures over fixed and repeatedly used datasets, risking overfitting to benchmark-specific characteristics. In contrast, data-centric efforts emphasize constructing large-scale datasets through costly and time-consuming subjective experiments, typically overlooking the strengths and failure modes of existing VQA models. This separation limits progress, leading to brittle generalization and inefficient use of annotation resources.To bridge the gap, we introduce MDS-VQA, a model-informed data selection method that integrates model-centric and data-centric VQA. In its specific instantiation, a learned failure prediction module trained via a learning-to-rank formulation is combined with a content diversity measure based on deep semantic video features.Experiments across multiple VQA datasets demonstrate that MDS-VQA effectively spots diverse and challenging samples that expose model weaknesses.The selected videos are proven to be particularly informative for fine-tuning, offering a principled path toward constructing more challenging datasets and developing more generalizable and robust VQA models.
PaperID: 1324,   Poster  https://arxiv.org/pdf/2604.16388    
Authors: Sebin Lee, Jumin Lee, Taeyeon Kim, Youngju Na, Woobin Im, Sung-Eui Yoon
Title: Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
Abstract: Rapidlyexploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code will be released publicly.
PaperID: 1325,   Poster  https://arxiv.org/pdf/2602.20689    
Authors: bedrettin cetinkaya, Sinan Kalkan, Emre Akbas
Title: MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
Abstract: Generating crisp, i.e. onepixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results.To address this limitation, we propose \MethodLPP, a lightweight, only ~21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2–4× compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20–35% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code and models will be released.
PaperID: 1326,   Poster  https://arxiv.org/pdf/2512.21778    
Authors: Nimrod Berman, Adam Botach, Emanuel Ben Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky
Title: Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Abstract: Segmenting longform videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context–focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision–recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.
PaperID: 1327,   Poster  https://arxiv.org/pdf/2511.21309    
Authors: Chenyu Liu, Hongze CHEN, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu HU, Yingda Yin, Keyang Luo, Xin Wang
Title: CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
Abstract: Despite major advances brought by diffusionbased models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling.To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure.It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity.Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization.Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.
PaperID: 1328,   Poster  https://arxiv.org/pdf/2603.05906    
Authors: Ping Chen, Zezhou Chen, Xingpeng Zhang, Yanlin Qian, Huan Hu, Xiang Liu, Zipeng Wang, Xin Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian
Title: Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
Abstract: Current 2Dto-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because "geometric reconstruction" paradigms mistake deliberate artistic intent—such as strategic zero-plane shifts for "pop-out" effects and local depth sculpting—for data "noise" or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.
PaperID: 1329,   Poster  https://arxiv.org/pdf/2603.22593    
Authors: Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias Duart, Dario Garcia-Gasulla
Title: Language Models Can Explain Visual Features via Steering
Abstract: Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlationbased explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it "sees", effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.
PaperID: 1330,   Poster  https://arxiv.org/pdf/2602.10095    
Authors: Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu
Title: Causality in Video Diffusers is Separable from Denoising
Abstract: Causality—referring to temporal, unidirectional cause–effect relationships between components—underlies many complex generative processes, including videos, language, and robot trajectories.Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context.In this paper, we show that the causal computation in these models is separable from the multi-step denoising process.Through systematic probing of autoregressive video diffusers, we uncover two key regularities:(1) early blocks produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and(2) deeper blocks exhibit sparse cross-frame attention and primarily perform intra-frame rendering.Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder.Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that CSD significantly improves throughput and latency while matching or surpassing the generation quality of strong causal diffusion baselines.
PaperID: 1331,   Poster  https://arxiv.org/pdf/2603.19608    
Authors: Ming Hu, Yongsheng Huo, Mingyu Dou, Jianfu Yin, Peng Zhao, Yao Wang, Cong Hu, Bingliang Hu, Quan Wang
Title: FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
Abstract: Occlusion occurs when one object partially or fully blocks another in a scene, making it difficult for an occlusion machine vision system to detect or track objects accurately. In zeroshot anomaly detection (ZSAD), the system needs to detect unseen defects without relying on labeled anomalous samples, which is critical for applications such as industrial inspection and medical imaging. However, normal features in images often occlude anomalous features, leading to coarse localization and limited discriminability. To address this challenge, we proposeFB-CLIP, which enhances foreground features while suppressing irrelevant background interference to improve anomaly detection performance. Unlike existing CLIP-based methods that typically rely on a single textual feature, FB-CLIP introducesMulti-Strategy Text Feature Fusion (MSTFF), combining End-of-Text, global pooling, and attention-weighted features to generate rich, task-aware text embeddings. Furthermore, FB-CLIP employsMulti-View Foreground-Background Enhancement (MVFBE),Background Suppression (BS), andSemantic Consistency Regularization (SCR)to achieve foreground reinforcement, background interference mitigation, and reliable visual-text alignment, respectively. Experiments on multiple public industrial and medical datasets show that FB-CLIP effectively captures fine-grained anomalies and outperforms existing zero-shot methods. Code will be released.
PaperID: 1332,   Poster  https://arxiv.org/pdf/2603.18596    
Authors: Xuan Liu, Xiaobin Chang
Title: Elastic Weight Consolidation Done Right for Continual Learning
Abstract: Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients.However, it has consistently shown suboptimal performance.In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradientbased perspective.For the first time, we find that EWC’s reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios.Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection.Consequently, both EWC and its variant exhibit fundamental misalignments in estimating the importance of weights, leading to inferior performance.To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies the importance estimation of EWC.Specifically, reversing the logit values during the calculation of the FIM can effectively prevent both the gradient vanishing and the redundant protection.Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).
PaperID: 1333,   Poster  https://arxiv.org/pdf/2602.22594    
Authors: Qing Yu, Akihisa Watanabe, Kent Fujiwara
Title: Causal Motion Diffusion Models for Autoregressive Motion Generation
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on fullsequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
PaperID: 1334,   Poster  https://arxiv.org/pdf/2511.07923    
Authors: Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Title: Exploring the Underwater World Segmentation without Extra Training
Abstract: Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduceAquaOV255, the first largescale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse marine organisms and man-made objects for open-vocabulary evaluation. Furthermore, we establish the first underwater open-vocabulary segmentation benchmark,UOVSBench, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive cross-domain evaluation. Alongside, we presentEarth2Ocean, a training-free open-vocabulary segmentation framework that transfers terrestrial vision–language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (GMG) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment(CSA) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves over 6+ mIoU improvement on average while maintaining efficient inference.
PaperID: 1335,   Poster  https://arxiv.org/pdf/2511.09681    
Authors: Tairan HUANG, Yulin Jin, Junxu Liu, Qingqing Ye, Haibo Hu
Title: SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
Abstract: Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing blackbox attacks focus on vector-based or discrete-action RL, and their effectiveness on image-based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real-world queries. Through a two-stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black-box and white-box methods. Our code is provided in the supplementary material.
PaperID: 1336,   Poster  https://arxiv.org/pdf/2603.02190    
Authors: Divyanshu Daiya, Aniket Bera
Title: Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Abstract: We present \emphSketch2Colab, which turns storyboardstyle 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student’s transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human–object–human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
PaperID: 1337,   Poster  https://arxiv.org/pdf/2511.20095    
Authors: Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, zhan qu, Dave Zhenyu Chen, Bingbing Liu, Xu Yan
Title: WPT: World-to-Policy Transfer via Online World Model Distillation
Abstract: Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatiotemporal correlations between an agent’s actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering endto-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher’s reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop), surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9× faster inference, while retaining most of the gains.
PaperID: 1338,   Poster  https://arxiv.org/pdf/2512.02505    
Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Qipeng Wang, Bo Yang
Title: GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarseto-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
PaperID: 1339,   Poster  https://arxiv.org/pdf/2510.15510    
Authors: Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim
Title: Exploring Conditions for Diffusion models in Robotic Control
Abstract: While pretrained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions—a successful strategy in other vision domains—yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
PaperID: 1340,   Poster  https://arxiv.org/pdf/2602.19140    
Authors: Sijie Mai, Shiqin Han
Title: CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion
Abstract: Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on oneto-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the 'one-to-many mapping' strategy in rectified flow that allows each data point of source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design 'adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce 'cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce modality gap.
PaperID: 1341,   Poster  https://arxiv.org/pdf/2507.18534    
Authors: Xingyu Qiu, Mengying Yang, Xinghua Ma, Dong Liang, Fanding Li, Gongning Luo, wei wang, Kuanquan Wang, Shuo Li
Title: Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
Abstract: Although EDM aims to unify the design space of diffusion models, its reliance on fixed Gaussian noise prevents it from explaining emerging flowbased methods that diffuse arbitrary noise. Moreover, our study reveals that EDM's forcible injection of Gaussian noise has adverse effects on image restoration task, as it corrupts the degraded images, overextends the restoration distance, and increases the task's complexity. To interpret diverse methods for handling distinct noise patterns within a unified theoretical framework and to minimize the restoration distance, we propose EDA, which Elucidates the Design space of Arbitrary-noise diffusion models. Theoretically, EDA expands noise pattern flexibility while preserving EDM's modularity, with rigorous proof that increased noise complexity introduces no additional computational overhead during restoration. EDA is validated on three representative medical image denoising and natural image restoration tasks: MRI bias field correction (global smooth noise), CT metal artifact removal (global sharp noise) and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, competitive results against specialized methods across medical and natural tasks demonstrate EDA's strong generalization capability for image restoration.
PaperID: 1342,   Poster  https://arxiv.org/pdf/2511.14063    
Authors: Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu
Title: Semantic Context Matters: Improving Conditioning for Autoregressive Models
Abstract: Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multimodal models compared to diffusion methods.However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality.To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models.SCAR introduces Compressed Semantic Prefilling and Semantic Alignment Guidance that jointly enhance contextual understanding and generation coherence. Unlike prior methods that rely on sparse visual tokens or decoding stage injection, SCAR enables strong semantic guidance from the input stage, while remaining model-agnostic and applicable to both next-token and next-scale AR paradigms.Extensive experiments on instruction editing and controllable generation demonstrate that our method significantly improves visual fidelity and semantic alignment, outperforming existing AR-based methods while maintaining controllability.All the code will be released.
PaperID: 1343,   Poster  https://arxiv.org/pdf/2603.03798    
Authors: Yu Sheng, Lidian Wang, Xiaomeng Chu, Jiajun Deng, Min Cheng, Yanyong Zhang, Bei Hua, Houqiang Li, Jianmin Ji
Title: Learning Surgical Robotic Manipulation with 3D Spatial Priors
Abstract: Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multiview features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.
PaperID: 1344,   Poster  https://arxiv.org/pdf/2512.12229    
Authors: Tianyu Zhang, Dong Liu, Chang-Wen Chen
Title: Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder
Abstract: Ultralow bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
PaperID: 1345,   Poster  https://arxiv.org/pdf/2503.22172    
Authors: Minho Park, Sunghyun Park, Jungsoo Lee, Hyojin Park, Kyuwoong Hwang, Fatih Porikli, Jaegul Choo, Sungha Choi
Title: CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
Abstract: This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through textto-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.
PaperID: 1346,   Poster  https://arxiv.org/pdf/2512.20299    
Authors: Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang
Title: KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
Abstract: Visual–language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on datadriven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual–language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.
PaperID: 1347,   Poster  https://arxiv.org/pdf/2511.17411    
Authors: Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Paudel
Title: SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Abstract: Robotic Foundation Models (RFMs) hold great promise as generalist, endto-end systems for robot control.Yet their ability to generalize across new environments, tasks, and embodiments remains limited.We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs).However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.Bridging this gap directly with large-scale robotic data is costly and difficult to scale.Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image.Building on SPEAR-VLM, we introduce our main contribution, ~SPEAR-1: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control.Trained on ~45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as \pi_0-FAST and \pi_0.5, while it uses 20× fewer robot demonstrations.This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data.We make our model weights and 3D-annotated datasets publicly available.
PaperID: 1348,   Poster  https://arxiv.org/pdf/2512.04221    
Authors: Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas
Title: MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
Abstract: While textto-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
PaperID: 1349,   Poster  https://arxiv.org/pdf/2602.21273    
Authors: Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang
Title: StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Abstract: Generating multiframe, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces the retained cues to build cross scene semantic ties. Compared with baseline methods, the experiments show that CLIP-T improves by up to 10–15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With a matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
PaperID: 1350,   Poster  https://arxiv.org/pdf/2512.04085    
Authors: Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen
Title: Unique Lives, Shared World: Learning from Single-Life Videos
Abstract: We introduce the ``singlelife” learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
PaperID: 1351,   Poster  https://arxiv.org/pdf/2603.02951    
Authors: Zhenquan Yao, Zitong Huang, yihan zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng Zuo
Title: CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
Abstract: Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. Existing works are generally trained on a fixed set of tasks and adapt to new tasks either through supervised finetuning (SFT) or reinforcement learning (RL), suffering from catastrophic forgetting and slow adaptation. In this work, we propose a Continual GUI Learning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. Additionally, we propose gradient surgery and entropy-regulated tuning strategies to enable GUI agents to continuously evolve while maintaining competence across previously learned domains. On top of that, we propose a AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of GUI continual learning. Experimental results demonstrate the effectiveness of our proposed CGL framework under the continual learning setting. The benchmark, code and model will be made publicly available.
PaperID: 1352,   Poster  https://arxiv.org/pdf/2603.12138    
Authors: Rui Shao, RUIZE GAO, Bin Xie, Li Yixing, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen
Title: HATS : Hardness-Aware Trajectory Synthesis for GUI Agents
Abstract: Graphical user interface (GUI) agents powered by large vision–language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for highquality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantic-ambiguous actions—interactions whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) a hardness-driven exploration that guides data collection toward ambiguous yet informative interactions, and (2) an alignment-guided refinement that iteratively validates and repairs instruction–execution consistency. The two modules operate in a closed-loop manner—exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
PaperID: 1353,   Poster  https://arxiv.org/pdf/2511.08903    
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
Abstract: Document layout understanding remains dataintensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2\pm0.3 AP using only 5% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7\pm0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1\pm0.4 AP, p=0.02) and matching UDOP~\citeudop (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.
PaperID: 1354,   Poster  https://arxiv.org/pdf/2511.12676    
Authors: Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere
Title: BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
Abstract: Deploying embodied agents that can answer questions about their surroundings in realistic realworld settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery.We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images.Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.
PaperID: 1355,   Poster  https://arxiv.org/pdf/2603.06289    
Authors: Zhen Wang, Youcan Xu, Jun Xiao, Long Chen
Title: FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Abstract: Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing trainingfree approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.
PaperID: 1356,   Poster  https://arxiv.org/pdf/2602.20497    
Authors: Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang
Title: LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or trainingfree forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov–Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00× acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25× speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00× acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
PaperID: 1357,   Poster  https://arxiv.org/pdf/2512.06868    
Authors: Xingguang Zhong, Liren Jin, Marija Popovic, Jens Behley, Cyrill Stachniss
Title: Dynamic Visual SLAM using a General 3D Prior
Abstract: Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patchbased online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model. Extensive experiments on multiple tasks show the superior performance of our proposed method compared to state-of-the-art approaches.
PaperID: 1358,   Poster  https://arxiv.org/pdf/2512.02906    
Authors: Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang
Title: MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Abstract: Understanding highresolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
PaperID: 1359,   Poster  https://arxiv.org/pdf/2602.18842    
Authors: Jiangling Zhang, Shuxuan Gao, Bofan Liu, Siqiang Feng, Jirui Huang, Yaxiong Chen, Ziyu Chen
Title: Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification
Abstract: The proliferation of highly realistic AIgenerated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries, struggling with novel manipulations as editing techniques evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, then a Task-Adaptive Prior Injection (TAPI) module converts this coarse prediction into guiding prompts that steer the MAE decoder to amplify reconstruction failures in suspicious regions, enabling precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.
PaperID: 1360,   Poster  https://arxiv.org/pdf/2403.02877    
Authors: Han Lu, Xiaosong Jia, Yichen Xie, Siyu Sun, Wenlong Liao, Xiaokang Yang, Junchi Yan
Title: ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
Abstract: Endto-end differentiable learning has emerged as a prominent paradigm in autonomous driving (AD). A significant bottleneck in this approach is its substantial demand for high-quality labeled data, such as 3D bounding boxes and semantic segmentation, which are especially expensive to annotate manually. This challenge is exacerbated by the long tailed distribution in AD datasets, where a substantial portion of the collected data might be trivial (e.g. simply driving straight on a straight road) and only a minority of instances are critical to safety. In this paper, we propose ActiveAD, a planning-oriented active learning strategy designed to enhance sampling and labeling efficiency in end-to-end autonomous driving. ActiveAD progressively annotates parts of collected raw data based on our newly developed metrics. We design innovative diversity metrics to enhance initial sample selection, addressing the cold-start problem. Furthermore, we develop uncertainty metrics to select valuable samples for the ultimate purpose of route planning during subsequent batch selection. Empirical results demonstrate that our approach significantly surpasses traditional active learning methods. Remarkably, our method achieves comparable results to state-of-the-art end-to-end AD methods - by using only 30% data in both open-loop nuScenes and closed-loop CARLA evaluation.
PaperID: 1361,   Poster  https://arxiv.org/pdf/2603.14684    
Authors: Yunsoo Kim, Changki Sung, Dasol Hong, Hyun Myung
Title: E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
Abstract: The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require highquality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.
PaperID: 1362,   Poster  https://arxiv.org/pdf/2604.02870    
Authors: Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung
Title: Token Warping Helps MLLMs Look from Nearby Viewpoints
Abstract: Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from nearby viewpoints? While MLLMs perform well on a single image reasoning, they remain fragile to viewpoint changes because pixellevel warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint warping. We compare two token-level transformation strategies, forward and backward warping, and find that backward token fetching, which selects tokens at target-view grid locations and retrieves their counterparts from the source view, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, while consistently outperforming all baselines, including pixel-warping approaches, MLLMs fine-tuned for spatial reasoning, and a generative warping method.
PaperID: 1363,   Poster  https://arxiv.org/pdf/2603.28319    
Authors: Luke Palmer, Petar Palasek, Hazem Abdelkawy
Title: Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
Abstract: Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods often collapse gaze into scanpaths or saliency maps, overlooking the dynamics of natural eye movements and introducing artefacts into training data. We instead propose a dynamical systems approach that treats gaze as an active agent interacting with its environment, enabling the simulation of raw, continuous gaze trajectories. In our approach, driving scenes are represented as gazecentric spatiotemporal graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze and surrounding traffic objects and road structure. We further introduce an Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic, object-centric nature of attentional shifts in complex environments. To support this research, we also present Focus100, a new dataset of gaze recordings from 30 participants viewing ego-centric driving footage. Trained directly on raw gaze, without any fixation filtering, our unified approach produces more natural gaze timeseries, scanpath dynamics, and saliency maps than existing attention estimation methods, offering valuable insights for the temporal modelling of human attention and automotive safety.
PaperID: 1364,   Poster  https://arxiv.org/pdf/2412.00666    
Authors: Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
Title: Explaining Object Detectors via Collective Contribution of Pixels
Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a gametheoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.
PaperID: 1365,   Poster  https://arxiv.org/pdf/2502.05708    
Authors: Kang Yang, Yuning Chen, Wan Du
Title: Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
Abstract: We present GRaF, Generalizable RadioFrequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing algorithm that estimates spectrum reception at the receiver. Experimental results demonstrate that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.
PaperID: 1366,   Poster  https://arxiv.org/pdf/2603.12146    
Authors: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu
Title: FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Abstract: Recent advances in trajectorycontrollable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation.We first train a trajectory adapter on a multi-step video generator for precise trajectory control.Then, we distill the generator into a few-step version to accelerate video generation.Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
PaperID: 1367,   Poster  https://arxiv.org/pdf/2512.01763    
Authors: Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, Rui Shao
Title: HiconAgent: History Context-aware Policy Optimization for GUI Agents
Abstract: Graphical User Interface (GUI) agents require effective utilization of historical context to perform sequential navigation tasks. While incorporating past actions and observations can significantly improve decisionmaking, naively using full history leads to excessive computational overhead and potential distraction from irrelevant information. In this work, we introduceHiconAgent, a GUI agent trained withHistory Context-aware Policy Optimization (HCPO)for effective and efficient utilization of historical information. HCPO explicitly optimizes history usage in both sampling and policy updates by integrating two complementary components:(1) Dynamic Context Sampling (DCS)presents the agent with variable-length histories during sampling, enabling adaptive use of the most relevant historical context to improve sequential decision quality;(2) Anchor-guided History Compression (AHC)refines the policy update phase via a dual-branch optimization strategy, where the compressed branch drops history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage, achieving efficiency with minimal performance degradation. Extensive experiments on mainstream GUI navigation benchmarks demonstrate the strong performance of our model. Despite its smaller size, HiconAgent-3B outperforms GUI-R1-7B by+8.46% grounding and+11.32% step successful rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW, with up to2.47× computational speedupand60% FLOPs reduction.
PaperID: 1368,   Poster  https://arxiv.org/pdf/2511.11450    
Authors: Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein
Title: VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Abstract: We introduceVoxTell, a vision–language model for textprompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning 1K+ anatomical and pathological classes, VoxTell uses multi-stage vision–language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code and model will be published at: www.github.com/anonymous
PaperID: 1369,   Poster  https://arxiv.org/pdf/2511.20592    
Authors: Mingxing Mingxing, Bowen Qu, Daniel Moyer
Title: Latent Diffusion Inversion Requires Understanding the Latent Space
Abstract: The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits nonuniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7% and substantial increases in TPR@1%FPR (6.42%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
PaperID: 1370,   Poster  https://arxiv.org/pdf/2510.10285    
Authors: Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang
Title: Reallocating Attention Across Layers to Reduce Multimodal Hallucination
Abstract: Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers.To alleviate these issues, we propose Functional Head Identification and ClassConditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2-percentage-point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.
PaperID: 1371,   Poster  https://arxiv.org/pdf/2602.23359    
Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan S Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Title: SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout–conditioned generation. It is essential for synthesizing partially occluded objects with depthconsistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
PaperID: 1372,   Poster  https://arxiv.org/pdf/2603.08224    
Authors: Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li
Title: SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Abstract: For videotext retrieval, the use of CLIP has been ade factostandard. However, as CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges:ineffective representation of speech contentandsuboptimal vision-audio fusion. To address these issues jointly, we proposeSAVE, aSpeechAwareVideo rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-1k, +1.9% on MSRVTT-3k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.
PaperID: 1373,   Poster  https://arxiv.org/pdf/2512.03454    
Authors: Haicheng Liao, Huanming Shen, Bonan Wang, yong kang li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, HAI YANG, Cheng-Zhong Xu, Zhenning Li
Title: Think Before You Drive: World Model-Inspired Multimodal Grounding
Abstract: Interpreting naturallanguage commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses SOTA baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it also shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data. Our anonymous code submission accompanies this paper, and the dataset will be released publicly.
PaperID: 1374,   Poster  https://arxiv.org/pdf/2510.06783    
Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, Muhammad Jehanzeb Mirza
Title: TTRV: Test-Time Reinforcement Learning for Vision Language Models
Abstract: Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment.In this work, we propose TTRV to enhance vision–language understanding by adapting the model onthe-fly at inference time, without the need for any labeled data.Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times.Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution.Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to Intern-VL-8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, \method still yields non-trivial improvements of up to 5.5% in recognition tasks.
PaperID: 1375,   Poster  https://arxiv.org/pdf/2511.12795    
Authors: Boshu Lei, Wen Jiang, Kostas Daniilidis
Title: ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
Abstract: Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energybased model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution.The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping.The source code of our paper will be made public when the paper is released to the public.
PaperID: 1376,   Poster  https://arxiv.org/pdf/2604.20539    
Authors: Mingze Sun, Cheng Zeng, Pei Jiansong, Junhao Chen, Chaoyue Song, Shaohui Wang, Tianyuan Chang, Bin Huang, Zijiao Zeng, Ruqi Huang
Title: Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
Abstract: Skeleton generation is essential for animating 3D assets, but current deep learning methods remain limited: they cannot handle the growing structural complexity of modern models and offer minimal controllability, creating a major bottleneck for realworld animation workflows. To address this, we propose an animator-centric SG framework that achieves high-quality skeleton prediction on complex inputs while providing intuitive control handles. Our contributions are threefold. First, we curate a large-scale dataset of 82,633 rigged meshes with diverse and complicated structures. Second, we introduce a novel semantic-aware tokenization scheme for auto-regressive modeling. This scheme effectively complements purely geometric prior methods by subdividing bones into semantically meaningful groups, thereby enhancing robustness to structural complexity and enabling a key control mechanism. Third, we design a learnable density interval module that allows animators to exert soft, direct control over bone density. Extensive experiments demonstrate that our framework not only generates high-quality skeletons for challenging inputs but also successfully fulfills two critical requirements from professional animators. Our work paves the way for more flexible and efficient animation pipelines.
PaperID: 1377,   Poster  https://arxiv.org/pdf/2602.19530    
Authors: Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed
Title: ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
Abstract: Visionlanguage models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero-shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task-specific discriminability. We introduce ORION, a text encoder fine-tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low-rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens’ theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug-and-play module on top of various state-of-the-art methods, and across different prediction settings (zero-shot, few-shot and test-time adaptation), ORION improves the performance consistently and significantly.
PaperID: 1378,   Poster  https://arxiv.org/pdf/2601.16046    
Authors: Junha Lee, Eunha Park, Minsu Cho
Title: DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Abstract: Languagedriven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
PaperID: 1379,   Poster  https://arxiv.org/pdf/2602.22620    
Authors: Tomoya Tsuchida, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii, Hajime Nagahara
Title: Coded-E2LF: Coded Aperture Light Field Imaging from Events
Abstract: We propose CodedE2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software is included in the supplementary material.
PaperID: 1380,   Poster  https://arxiv.org/pdf/2603.16936    
Authors: Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yunlong Tang, Zichong Xu, Susan Liang, Jing Bi, Jason Corso, Chenliang Xu
Title: Bridging Facial Understanding and Animation via Language Models
Abstract: Textguided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
PaperID: 1381,   Poster  https://arxiv.org/pdf/2511.19278    
Authors: Qianying Liu, Xiao Liang, Zhiqiang Zhang, Yibo Chen, Xu Tang, Zhongfei Qing, Fengfan Zhou, Yao Hu, Paul Henderson
Title: ReMatch: Boosting Representation through Matching for Multimodal Retrieval
Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and underutilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline, we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark(MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
PaperID: 1382,   Poster  https://arxiv.org/pdf/2602.08029    
Authors: Berthy Feng, Andrew Chael, David Bromley, Aviad Levis, William Freeman, Katie Bouman
Title: Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Abstract: With the success of static blackhole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we proposePINeRF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the estimated emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a totally physics-agnostic approach. We demonstrate how our method can be used to estimate other physics parameters of the black hole, such as its spin.
PaperID: 1383,   Poster  https://arxiv.org/pdf/2604.08366    
Authors: Tolga Dimlioglu, Nadine Chang, Maying Shen, Rafid Mahmood, Jose M. Alvarez
Title: Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
Abstract: Largescale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address the different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80% less data.
PaperID: 1384,   Poster  https://arxiv.org/pdf/2604.19406    
Authors: Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu, Jiaxiu Jiang, Xinran Qin, Zhikai Chen, Fenglong Song, Zhixin Wang, Renjing Pei, Wangmeng Zuo
Title: HP-Edit: A Human-Preference Post-Training Framework for Image Editing
Abstract: Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for realworld content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs.To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing.Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer—an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model.We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human-preferred results.
PaperID: 1385,   Poster  https://arxiv.org/pdf/2512.11253    
Authors: Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun
Title: PersonaLive! Expressive Portrait Image Animation for Live Streaming
Abstract: Current diffusionbased portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22× speedup over prior diffusion-based portrait animation models. The code will be publicly available.
PaperID: 1386,   Poster  https://arxiv.org/pdf/2506.22881    
Authors: Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo
Title: CLIP-like Model as a Foundational Density Ratio Estimator
Abstract: Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihoodfree inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image–text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications.To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities.We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimationOur Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points.We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.Our code will be publicly available.
PaperID: 1387,   Poster  https://arxiv.org/pdf/2511.17309    
Authors: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Title: MuM: Multi-View Masked Image Modeling for 3D Vision
Abstract: Selfsupervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
PaperID: 1388,   Poster  https://arxiv.org/pdf/2507.22052    
Authors: ZIREN GONG, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi
Title: Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
Abstract: We present Ov3R, a novel framework for openvocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.
PaperID: 1389,   Poster  https://arxiv.org/pdf/2511.18333    
Authors: Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Quan Wang, Dahua Lin
Title: ConsistCompose: Unified Multimodal Layout Control for Image Composition
Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding—aligning language with image regions—while their generative counterpart, linguisticembedded layout-grounded generation(LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance–coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.
PaperID: 1390,   Poster  https://arxiv.org/pdf/2603.26096    
Authors: Hyeongyu Kim, GeonHui Han, Dosik Hwang
Title: AcTTA : Rethinking Test-Time Adaptation via Dynamic Activation
Abstract: Testtime adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift–robust test-time learning, broadening the prevailing affine-centric view of adaptation.
PaperID: 1391,   Poster  https://arxiv.org/pdf/2510.13714    
Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja Yadwadkar
Title: Dedelayed: Deleting remote inference delay via on-device correction
Abstract: Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology.Yet, the most powerful video understanding models are too expensive for the resourceconstrained platforms used in these applications.One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time.But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications.The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy.To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame.The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame.The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel.We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference---an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at URL REDACTED FOR ANONYMITY.
PaperID: 1392,   Poster  https://arxiv.org/pdf/2511.10040    
Authors: Xinran Yang, Shuichang Lai, Jiangjing Lyu, Hongjie Li, Bowen Pan, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo
Title: LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
Abstract: Generating highfidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies—such as open surfaces and intricate internal structures—while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)—a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.
PaperID: 1393,   Poster  https://arxiv.org/pdf/2602.22013    
Authors: I-Hsiang (Aaron) Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chieh Yang, Wei-Ting Chen
Title: RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Abstract: Visionbased Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
PaperID: 1394,   Poster  https://arxiv.org/pdf/2603.00574    
Authors: Yongbo He, Zirun Guo, Tao Jin
Title: Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
Abstract: Adapting pretrained multimodal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we proposeDecouplingAdaptation forStability andPlasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements anasymmetric adaptationstrategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL divergence regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
PaperID: 1395,   Poster  https://arxiv.org/pdf/2603.11554    
Authors: Lirong Che, Shuo Wen, Huang Shan, wang chuang, yuzhe yang, Gregory Dudek, Xueqian Wang, Jian Su
Title: MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Abstract: Realworld robotic tasks are long-horizon and often span multiple floors, requiring complex spatial reasoning. Existing embodied benchmarks, however, are largely confined to single-floor homes, failing to evaluate agents on realistic, building-scale tasks. We introduce MANSION, a language-driven framework for generating building-scale, multi-floor 3D environments for long-horizon tasks. Using this framework, we release MansionWorld, a large-scale dataset featuring over 1,000 diverse, non-residential buildings. These environments support cross-floor skills and long-horizon task generation on reusable building layouts. Experiments show that current methods degrade sharply on our multi-floor tasks, highlighting both the challenge and the value of this setting for advancing embodied AI.
PaperID: 1396,   Poster  https://arxiv.org/pdf/2511.22249    
Authors: Bolin Lai, XuDong Wang, Sai Saketh Rambhatla, James Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra
Title: Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
Abstract: Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction–generation tradeoff as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency tokenization and detokenization. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional tokenizers, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.14 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.
PaperID: 1397,   Poster  https://arxiv.org/pdf/2603.30038    
Authors: Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao
Title: Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
Abstract: AIassisted coding has rapidly reshaped software practice and research workflows, yet today’s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 28.0% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that “more paper text” is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.
PaperID: 1398,   Poster  https://arxiv.org/pdf/2511.12578    
Authors: Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang, Zuxuan Wu, Chi Zhang, Xuelong Li
Title: TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
Abstract: We present TempoMaster, a novel framework that formulates long video generation as nextframe-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
PaperID: 1399,   Poster  https://arxiv.org/pdf/2507.13353    
Authors: Shihao Wang, Guo Chen, De-An Huang, Zhiqi Li, Minghan LI, Guilin Liu, Jan Kautz, Jose M. Alvarez, Lei Zhang, Zhiding Yu
Title: VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
Abstract: While Video Large Language Models (VideoLLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs’ visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.
PaperID: 1400,   Poster  https://arxiv.org/pdf/2512.09923    
Authors: Or Hirschorn, Omer Sela, Inbar Huberman-Spiegelglas, Netalee Efrat Sela, Eli Alshan, Ianir Ideses, Frederic Devernay, Yochai Zvik, Lior Fritz
Title: Splatent: Splatting Diffusion Latents for Novel View Synthesis
Abstract: Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusionbased pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery.Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.We will release our code upon publication.
PaperID: 1401,   Poster  https://arxiv.org/pdf/2411.16750    
Authors: Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
Title: Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Abstract: Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the textto-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth predication can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.
PaperID: 1402,   Poster  https://arxiv.org/pdf/2603.14610    
Authors: Harel Yadid, Meir Yossef Levi, Roy Betser, Guy Gilboa
Title: Make it SING: Analyzing Semantic Invariants in Classifiers
Abstract: All classifiers, including stateof-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DINO-ViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.
PaperID: 1403,   Poster  https://arxiv.org/pdf/2602.22862    
Authors: Enda Xiang, Haoxiang Ma, Xinzhu Ma, Zicheng Liu, Di Huang
Title: GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
Abstract: This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusionbased policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
PaperID: 1404,   Poster  https://arxiv.org/pdf/2603.22918    
Authors: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
Title: EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames.Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning.Recent agentbased methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary–plan–action–reflection reasoning.EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding.To train such agents, we design a simple yet effective three-stage learning pipeline—comprising supervised fine-tuning (SFT), Kahneman–Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO)—that bridges supervised imitation and reinforcement learning.We further construct high-quality datasets for each stage, supporting stable and reproducible training.We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6--12% over general MLLM baselines and a further 1--3% gain over prior adaptive agent methods.
PaperID: 1405,   Poster  https://arxiv.org/pdf/2504.13204    
Authors: Dmytro Kotovenko, Olga Grebenkova, Björn Ommer
Title: EDGS: Eliminating Densification for Efficient Convergence of 3DGS
Abstract: 3D Gaussian Splatting reconstructs scenes by starting from a sparse Structurefrom-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely on sparse keypoints, our dense initialization ensures uniform detailacross the scene, even in high-frequency regions where other methods struggle. Moreover, since all splats are initializedin parallel at the start of optimization, we remove the need to wait for densification to adjust new Gaussians.EDGS reaches LPIPS and SSIM performance of standard 3DGS significantly faster than existing efficiency-focused approaches. When trained further, it exceeds the reconstruction quality of state-of-the-art models aimed at maximizing fidelity. Our method is fully compatible with other acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.
PaperID: 1406,   Poster  https://arxiv.org/pdf/2511.23055    
Authors: Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng
Title: MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Abstract: Theory of Mind (ToM) refers to the ability to infer others’ mental states, such as beliefs, desires, and intentions. Current vision–language embodied agents lack ToMbased decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent’s own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
PaperID: 1407,   Poster  https://arxiv.org/pdf/2603.01026    
Authors: Shengpeng Wang, Kuangyu Wang, Wei Wang
Title: RaUF: Learning the Spatial Uncertainty Field of Radar
Abstract: Millimeterwave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios. Our dataset will be released to the community.
PaperID: 1408,   Poster  https://arxiv.org/pdf/2510.19808    
Authors: Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan
Title: Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Abstract: Recent advances in multimodal models have demonstrated remarkable textguided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
PaperID: 1409,   Poster  https://arxiv.org/pdf/2604.01570    
Authors: Haochen Niu, Kanyu Zhang, Shuyu Yin, Qinghai Guo, Peilin Liu, Fei Wen
Title: Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
Abstract: In realworld robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation. Code is provided in supplemental material.
PaperID: 1410,   Poster  https://arxiv.org/pdf/2510.23043    
Authors: Joungbin An, Kristen Grauman
Title: HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and finegrained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or using fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba’s selective scanning to produce compact anchor tokens summarizing video content across scales. We further introduce anchor-conditioned and segment-pooled contrastive losses-two complementary objectives that encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
PaperID: 1411,   Poster  https://arxiv.org/pdf/2510.08316    
Authors: Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen
Title: Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Abstract: Affordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or promptbased frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large-scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross-modal alignment mechanism. Specifically, we propose Cross-Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconstruction and feature diversity, which together encourage structured and discriminative feature learning. Built upon the CMAT-pretrained backbone, we employ a lightweight affordance segmentor that injects text or visual prompts into the learned 3D space through an efficient cross-attention interface, enabling dense and prompt-aware affordance prediction while preserving the semantic organization established during pretraining. Extensive experiments demonstrate consistent improvements over previous state-of-the-art methods in both accuracy and efficiency.
PaperID: 1412,   Poster  https://arxiv.org/pdf/2602.20363    
Authors: Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, Zhou Wang
Title: Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
Abstract: The aesthetic quality of a scene depends strongly on camera viewpoint.Existing approaches for aesthetic viewpoint suggestion are either singleview adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches.We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealling viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.
PaperID: 1413,   Poster  https://arxiv.org/pdf/2510.14981    
Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu
Title: Coupled Diffusion Sampling for Training-free Multi-view Image Editing
Abstract: Given a collection of multiview images, we perform consistent multi-view editing with a training-free framework using pre-trained 2D editing models and a generative multi-view model.While 2D editing models can independently edit each image in a set of multi-view images of a 3D scene, they do not maintain consistency across views.Existing approaches typically rely on explicit 3D representations to average out the inconsistencies, but they suffer from a lengthy optimization, instability under sparse view settings, and can produce blurry results.We address the problem from a different lens, where we use the 2D editing model to steer a multi-view generative model in the diffusion sampling process.This is achieved through our novel coupled diffusion sampling process. We concurrently sample two trajectories from both a multi-view image distribution and a 2D edited image distribution, and connect the samples with a coupling term. Effectively, the two models guide each other during sampling, and the resulting sample from the multi-view model remains consistent while satisfying the desired edit.We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, and demonstrate its applicability across various model architectures. We further illustrate the effects of coupling on SoTA image and video generation models, highlighting the potential of our method beyond multi-view editing.
PaperID: 1414,   Poster  https://arxiv.org/pdf/2511.22055    
Authors: JING HAO, Yuci Liang, Lin Lizhuo, Yuxuan Fan, Wenkai Zhou, Kaixin Guo, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Yanqi Yang, Qiankun Li, Hao Tang, James Tsoi, Linlin Shen, Kuo Hung
Title: OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Abstract: Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties, yet dentistry remains underexplored, in part due to limited domainspecific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists’ diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists’ decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model’s capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question–answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.
PaperID: 1415,   Poster  https://arxiv.org/pdf/2603.05629    
Authors: Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas
Title: Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
Abstract: Concept Bottleneck Models (CBMs) ground predictions in humanunderstandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the ``linearity problem'' causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.
PaperID: 1416,   Poster  https://arxiv.org/pdf/2604.17195    
Authors: Junjia Huang, Binbin Yang, Pengxiang Yan, JiyangLiu JiyangLiu, Bin Xia, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li
Title: DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
Abstract: Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from textto-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
PaperID: 1417,   Poster  https://arxiv.org/pdf/2604.03716    
Authors: Haimin Luo, Srinjay Sarkar, Albert Mosella-Montoro, Francisco Vicente Carrasco, Fernando De la Torre
Title: CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
Abstract: We present a compact pipeline for highfidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200× lower memory footprint.
PaperID: 1418,   Poster  https://arxiv.org/pdf/2603.07545    
Authors: Jinzhou Tang, Fan Feng, Minghao Fu, Wenjun Lin, Jing Yang, Biwei Huang, Keze Wang
Title: DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
Abstract: Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce Symmetry Exploration, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonianbased curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, DreamSAC, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.
PaperID: 1419,   Poster  https://arxiv.org/pdf/2603.05898    
Authors: Yuxin Qin, Ke Cao, Haowei Liu, Ao Ma, Fengheng Li, Honghe Zhu, Zheng Zhang, Run Ling, Wei Feng, Xuanhua He, Zhanjie Zhang, Zhen Guo, Haoyi Bian, Jingjing Lv, Junjie Shen, Ching Law
Title: InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
Abstract: Ecommerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style.To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, text, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops.To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.
PaperID: 1420,   Poster  https://arxiv.org/pdf/2407.01007    
Authors: Yihao Zhen, Mingyue Xu, Qiang Wang, Baojie Fan, Jiahua Dong, Tinghui Zhao, Huijie Fan
Title: GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking
Abstract: A frequently cited advantage of MultiCamera Multi-Target (MCMT) Tracking is that the introduction of multiple views provides rich discriminative visual representations for each target. Existing MCMT models typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, the use of multiple views is confined to recovering missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose a novel global MCMT tracking framework termed GMT, which effectively leverages the advantage of multi-view by performing global-level trajectory-target matching. Specifically, instead of assigning trajectories independently for each view, we propose a Cross-View Feature Consistency Enhancement(CFCE) module to reduce the feature discrepancies across different views, and encode the same historical targets across different views as global trajectories. The Global Trajectory Associate (GTA) module is then introduced to associate new targets to global trajectories, allowing the model to jointly exploit both intra-view and inter-view cues during tracking. Compared with the two-stage framework, the GMT achieves significant improvements on existing datasets, with gains of up to 13.1% in CVMA in and 19.2% in CVIDF1. Moreover, we present VisionTrack, a high-quality, large-scale MCMT dataset encompassing diverse scenes with varying illumination and target distributions, providing significantly greater diversity than existing datasets. Our code and dataset will be released.
PaperID: 1421,   Poster  https://arxiv.org/pdf/2512.14236    
Authors: Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari
Title: Controllable Stereo Video Conversion with Guided Latent Decoding
Abstract: The growing demand for immersive 3D content calls for automated monocularto-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion.
PaperID: 1422,   Poster  https://arxiv.org/pdf/2511.15164    
Authors: Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang
Title: Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves stateof-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
PaperID: 1423,   Poster  https://arxiv.org/pdf/2603.25994    
Authors: Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries, Golnoosh Farnadi
Title: Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Abstract: Concept erasure in textto-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
PaperID: 1424,   Poster  https://arxiv.org/pdf/2601.15221    
Authors: Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao, Yujun Shen, Yiyi Liao
Title: ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
Abstract: Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging realworld datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.
PaperID: 1425,   Poster  https://arxiv.org/pdf/2602.20871    
Authors: Wenbo Yu, Wenke Xia, Weitao Zhang, Di Hu
Title: GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
Abstract: Bridging the simto-real gap is important for applying low-cost simulation data to real-world robotic systems. However, previous methods are severely limited by treating each transfer as an isolated endeavor, demanding repeated, costly tuning and wasting prior transfer experience.To move beyond isolated sim-to-real, we build a continual cross-task sim-to-real transfer paradigm centered on knowledge accumulation across iterative transfers, thereby enabling effective and efficient adaptation to novel tasks. Thus, we propose GeCo-SRT, a geometry-aware continual adaptation method. It utilizes domain-invariant and task-invariant knowledge from local geometric features as a transferable foundation to accelerate adaptation during subsequent sim-to-real transfers. This method starts with a geometry-aware mixture-of-experts module, which dynamically activates experts to specialize in distinct geometric knowledge to bridge observation sim-to-real gap. Further, the geometry-expert-guided prioritized experience replay module preferentially samples from underutilized experts, refreshing specialized knowledge to combat forgetting and maintain robust cross-task performance.Leveraging knowledge accumulated during iterative transfer, GeCo-SRT method not only achieves 52% average performance improvement over the baseline, but also demonstrates significant data efficiency for new task adaptation with only 1/6 data.We hope this work inspires approaches for efficient, low-cost cross-task sim-to-real transfer.
PaperID: 1426,   Poster  https://arxiv.org/pdf/2603.25203    
Authors: Ruichao Yang, Wei Gao, Xiaobin Zhu, Jing Ma, Hongzhan Lin, Ziyang Luo, Bo-Wen Zhang, Xu-Cheng Yin
Title: Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable, modular, and evolvable framework that reframes multimodal misinformation detection (MMD) as structured, conceptbased reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.
PaperID: 1427,   Poster  https://arxiv.org/pdf/2512.06358    
Authors: Mingjia Li, Jin Hu, Hainuo Wang, Qiming Hu, Jiarui Wang, Xiaojie Guo
Title: Rectifying Latent Space for Generative Single-Image Reflection Removal
Abstract: Singleimage reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers.Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for an optimal result. Extensive experiments reveal that our model achieves new state-of-the-art performance on multiple benchmarks and generalizes well to challenging real-world images. Code will be made publicly available.
PaperID: 1428,   Poster  https://arxiv.org/pdf/2511.18900    
Authors: Xiuchao Wu, Pengfei Zhu, Jiangjing Lyu, Xinguo Liu, Jie Guo, Yanwen Guo, Weiwei Xu, Chengfei Lv
Title: MatMart: Material Reconstruction of 3D Objects via Diffusion
Abstract: Applying diffusion models to physicallybased material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that MatMart achieves superior performance in material reconstruction compared to existing methods.
PaperID: 1429,   Poster  https://arxiv.org/pdf/2512.16906    
Authors: Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma
Title: VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Abstract: Instructionbased video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video–instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
PaperID: 1430,   Poster  https://arxiv.org/pdf/2601.15283    
Authors: Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt
Title: LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
Abstract: We present a novel approach for interactive light editing in indoor scenes from a single multiview scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques.
PaperID: 1431,   Poster  https://arxiv.org/pdf/2603.15656    
Authors: Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian
Title: Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Abstract: The performance of neural network models deteriorates due to their unreliable behavior on nonrobust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.
PaperID: 1432,   Poster  https://arxiv.org/pdf/2511.16317    
Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, Chunchao Guo
Title: NaTex: Seamless Texture Generation as Latent Color Diffusion
Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multiview images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE–DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
PaperID: 1433,   Poster  https://arxiv.org/pdf/2603.18561    
Authors: Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, Jian Pu
Title: CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Abstract: Planningoriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. As its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS first constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby purifying the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method also demonstrates superior robustness against both data bias and noisy scenarios specifically configured to induce causal confusion. We will release our code upon paper acceptance.
PaperID: 1434,   Poster  https://arxiv.org/pdf/2604.01749    
Authors: Jiayun Jin, Haolong Chai, Xueying Huang, Xiaoqing Guo, Zengwei Zheng, Zhan Zhou, Junmei Wang, Xinyu Wang, Jie Liu, Binbin Zhou
Title: Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its realtime capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for modalities like CT and MRI, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes.To bridge this gap, we construct US-365K, a large-scale ultrasound image–text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks, Ultrasonographic Hierarchical Anatomical Taxonomy (UHAT) and Ultrasonographic Diagnostic Attribute Framework (UDAF). UHAT standardizes anatomical organization, and UDAF formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity.Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion–attribute relations.Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks built upon US-365K, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
PaperID: 1435,   Poster  https://arxiv.org/pdf/2506.02015    
Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim
Title: OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with finegrained text–image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose \underline\textObject-centric \underline\textSelf-improving \underline\textPreference \underline\textOptimization (OSPO), a self-improving framework designed to enhance object-level text–image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.
PaperID: 1436,   Poster  https://arxiv.org/pdf/2601.20894    
Authors: Jiangyang Li, Chenhao Ding, SongLin Dong, Qiang Wang, Jianchao Zhao, Yuhang He, Yihong Gong
Title: Is Parameter Isolation Better for Prompt-Based Continual Learning?
Abstract: Promptbased continual learning methods effectively mitigate catastrophic forgetting. However, most existing methods assign a fixed set of prompts to each task, completely isolating knowledge across tasks and resulting in suboptimal parameter utilization. To address this, we consider the practical needs of continual learning and propose a prompt-sharing framework. This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts to achieve dynamic decoupling and collaborative optimization of task-specific feature representations. Furthermore, we introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates, thereby mitigating inefficient parameter usage and knowledge forgetting. Extensive analysis and empirical results demonstrate that our approach consistently outperforms existing static allocation strategies in effectiveness and efficiency. Code and models will be released.
PaperID: 1437,   Poster  https://arxiv.org/pdf/2603.04733    
Authors: Xingyu Wang, Tao Wang
Title: FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Abstract: TestTime Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities.In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy.To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.
PaperID: 1438,   Poster  https://arxiv.org/pdf/2604.01310    
Authors: Omid Nejatimanzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Title: Sparse Spectral LoRA: Routed Experts for Medical VLMs
Abstract: Large vision–language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces crossdataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift.Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339× fewer trainable parameters, and reduces sequential forgetting to ~5% where strong baselines degrade by >20–50%.
PaperID: 1439,   Poster  https://arxiv.org/pdf/2604.12518    
Authors: Kang He, Yuzhe Ding, Xinrong Wang, Fei Li, Chong Teng, Donghong Ji
Title: Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis
Abstract: Multimodal sentiment analysis (MSA) seeks to infer human emotions by integrating heterogeneous signals from text, audio, and visual modalities.Although recent approaches attempt to leverage crossmodal complementarity, they often struggle to fully utilize weaker modalities.In practice, the expressive power across modalities is inherently imbalanced: dominant modalities tend to overshadow non-verbal ones, which not only limits their contribution but also induces modality competition during training.This imbalance leads to degraded fusion performance and poor robustness under noisy or missing modalities.To address these challenges, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC first improves representational quality via modality semantic disentanglement (MSD) and cross-modal complementary enhancement (CCE), which strengthens weaker modalities using information from other modalities.To prevent dominant modalities from overwhelming others during joint optimization, EBMC introduces an Energy-guided Modality Coordination (EMC) mechanism that models modality contributions via energy potentials and achieves implicit gradient rebalancing through a differentiable equilibrium objective.Further, an Instance-aware Modality Trust Distillation (IMTD) module estimates sample-level modality reliability and adaptively modulates fusion weights, ensuring robustness against noise and modality incompleteness.Extensive experiments on multiple MSA benchmarks demonstrate that EBMC achieves state-of-the-art or competitive results.Moreover, EBMC maintains strong performance under missing-modality settings, highlighting its effectiveness and robustness.
PaperID: 1440,   Poster  https://arxiv.org/pdf/2603.08155    
Authors: Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang, Jinwei Chen, Peng-Tao Jiang, Bo Li, Jia Wang
Title: C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
Abstract: ClassifierFree Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process.This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce Control Classifier-Free Guidance (C^2FG), a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C^2FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.
PaperID: 1441,   Poster  https://arxiv.org/pdf/2603.11617    
Authors: Lu Niu, Cheng Xue
Title: Noise-aware few-shot learning through bi-directional multi-view prompt alignment
Abstract: Visionlanguage models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework forNoise-Aware few-shot learning through bi-directionalMulti-ViewPrompt alignment. NA-MVP is built upon a key conceptual shift:robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.
PaperID: 1442,   Poster  https://arxiv.org/pdf/2602.20933    
Authors: Shuangkang Fang, I-Chao Shen, Xuanyang Zhang, Zesheng Wang, Yufeng Wang, Wenrui Ding, Gang Yu, Takeo Igarashi
Title: Dropping Anchor and Spherical Harmonics for Gaussian Splatting
Abstract: Recent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparseview conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coefficients to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing dropout methods with negligible computational overhead and can be readily integrated into various 3DGS variants to enhance their performances.
PaperID: 1443,   Poster  https://arxiv.org/pdf/2603.17979    
Authors: Jinho Park, Se Young Chun, Mingoo Seok
Title: AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Abstract: Radar is a critical perception modality in autonomous driving systems due to its allweather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.
PaperID: 1444,   Poster  https://arxiv.org/pdf/2512.18215    
Authors: Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu
Title: Stable and Efficient Single-Rollout RL for Multimodal Reasoning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent groupbased algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this sample efficiency-stability trade-off, we introduce MSSR (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior rollout sample efficiency, achieving similar validation accuracy with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, sample-efficient, and effective RLVR for complex multimodal reasoning tasks. We will release code and checkpoints upon acceptance.
PaperID: 1445,   Poster  https://arxiv.org/pdf/2601.02036    
Authors: Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao
Title: GDRO: Group-level Reward Post-training Suitable for Diffusion Models
Abstract: Recent advancements adopt online reinforcement learning (RL) from LLMs to textto-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
PaperID: 1446,   Poster  https://arxiv.org/pdf/2512.10571    
Authors: Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang
Title: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Abstract: Recent advancements in video generation highlight that realistic audiovisual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization.
PaperID: 1447,   Poster  https://arxiv.org/pdf/2508.05423    
Authors: Yixuan Zhang, Jinhao Sheng, Wenxin Zhang, Quyu Kong, Feng Zhou
Title: Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
Abstract: Although artificial neural networks are often described as braininspired, their representations typically rely on continuous activations, such as the continuous latent variables in variational autoencoders (VAEs), which limits their biological plausibility compared to the discrete spike-based signaling in real neurons. Extensions like the Poisson VAE introduce discrete count-based latents, but their equal mean-variance assumption fails to capture overdispersion in neural spikes, leading to less expressive and informative representations. To address this, we propose NegBio-VAE, a negative-binomial latent-variable model with a dispersion parameter for flexible spike count modeling. NegBio-VAE preserves interpretability while improving representation quality and training feasibility via novel KL estimation and reparameterization. Experiments on four datasets demonstrate that NegBio-VAE consistently achieves superior reconstruction and generation performance, and yields robust, informative latent representations for downstream tasks. Extensive ablation studies are performed to verify the model’s robustness w.r.t. various components.
PaperID: 1448,   Poster  https://arxiv.org/pdf/2603.19121    
Authors: Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao
Title: CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
Abstract: The creation of highfidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instances cross attention, to ensure semantic plausibility and "reference-instance" alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.
PaperID: 1449,   Poster  https://arxiv.org/pdf/2512.08923    
Authors: Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M Asano
Title: Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Abstract: We introduce two new benchmarks REST and REST+ (RenderEquivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
PaperID: 1450,   Poster  https://arxiv.org/pdf/2602.23823    
Authors: Henghui Du, Chang Zhou, Xi Chen, Di Hu
Title: APPO: Attention-guided Perception Policy Optimization for Video Reasoning
Abstract: Complex video reasoning, actually, relies excessively on finegrained perception rather than on expert (e.g., Ph.D, Science)-level reasoning.Through extensive empirical observation, we have recognized the critical impact of perception.In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance.Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile.To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception.The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens).Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5% ~ 4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.
PaperID: 1451,   Poster  https://arxiv.org/pdf/2601.22275    
Authors: Cheng Liang, Haoxian Chen, Liang Hou, Qi Fan, Gangshan Wu, Xin Tao, Limin Wang
Title: VMonarch: Efficient Video Diffusion Transformers with Structured Attention
Abstract: The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatiotemporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal fine-tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5× in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.
PaperID: 1452,   Poster  https://arxiv.org/pdf/2511.12267    
Authors: Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao
Title: ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks
Abstract: Ultrahigh-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping–zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.
PaperID: 1453,   Poster  https://arxiv.org/pdf/2603.17995    
Authors: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao P. Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen
Title: LoST: Level of Semantics Tokenization for 3D Shapes
Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. Stateof-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss, inspired by relational knowledge distillation, that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%–10% of the tokens needed by prior 3D AR models. Code will be released to facilitate future research.
PaperID: 1454,   Poster  https://arxiv.org/pdf/2603.24570    
Authors: Hong Duc Vu, Anh Nguyen, Chi Tran, Anh Tran
Title: Anti-I2V: Safeguarding your photos from malicious image-to-video generation
Abstract: Advances in diffusionbased video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the Lab and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
PaperID: 1455,   Poster  https://arxiv.org/pdf/2512.17911    
Authors: Hongji Li, Manjiang Yu, Junchi Yao, PRIYANKA SINGH, Xue Li, Di Wang, Lijie Hu
Title: Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models
Abstract: Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is especially challenging: intermediate chainof-thought steps can still leak sensitive information even when final answers are forgotten, and aggressive interventions easily damage general reasoning ability. However, existing benchmarks do not jointly evaluate how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with explicit reasoning traces and dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench shows that current unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To overcome these limitations, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free, inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning ability. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention than existing approaches. Our code and data will be released upon acceptance.
PaperID: 1456,   Poster  https://arxiv.org/pdf/2603.18782    
Authors: Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Title: Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many realworld scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input.Experiments on both single-object and multi-object scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.
PaperID: 1457,   Poster  https://arxiv.org/pdf/2602.22096    
Authors: Wenhua Wu, Huai Guan, Zhe Liu, Hesheng Wang
Title: WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
Abstract: Editable highfidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns.Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.
PaperID: 1458,   Poster  https://arxiv.org/pdf/2603.18093    
Authors: Haoxiang Rao, Zhao Wang, Chenyang Si, Yan LYU, Yuanyi Duan, Fang Zhao, Caifeng Shan
Title: One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous fewshot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of \method, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.
PaperID: 1459,   Poster  https://arxiv.org/pdf/2511.18960    
Authors: Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu
Title: AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Abstract: VisionLanguage-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
PaperID: 1460,   Poster  https://arxiv.org/pdf/2510.10181    
Authors: Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He, Wenyuan XIE, Guodong Zhang, Bayram Bayramli, Yue Ding, Hongtao Lu
Title: Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
Abstract: Embodied agents face a fundamental limitation: once deployed in realworld environments to perform specific tasks, they are unable to acquire additional knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN identifies contextually prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards to train EFN, ensuring that the predicted actions align with past behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit “learning from experience”. Experiments across diverse embodied tasks show that EFN improves adaptability, robustness, and success rates over frozen baselines. We provide code and demo in our supplementary material.
PaperID: 1461,   Poster  https://arxiv.org/pdf/2510.15104    
Authors: Guofeng Zhang, Angtian Wang, Jacob Fang Fang, Liming Jiang, Haotian Yang, Bo Liu, Yiding Yang, Guang Chen, Longyin Wen, Alan L. Yuille, Chongyang Ma
Title: TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
Abstract: Textto-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches.
PaperID: 1462,   Poster  https://arxiv.org/pdf/2603.03765    
Authors: Qihao Sun, Jiarun Liu, Ziqian Ni, Jianyun Xu, Sheng Yang, Tao Xie, lijun zhao, Ruifeng Li
Title: LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
Abstract: Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multiview and temporal consistency, and cross-domain generalization.To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames.Built upon these principles, MVS-Pro embeds the LiDAR prompt in two ways: as a hard geometric prior anchoring the cost volume, and as soft feature-wise guidance fused by a triple cues combiner.As for temporal consistency, MVS-Pro leverages a spatio-temporal decoder that jointly exploits geometric cues from the MVS cost volume and temporal context from neighboring frames. Experiments show that MVS-Pro achieves state-of-the-art performance on multiple benchmarks, excelling in metric accuracy, temporal stability, and zero-shot cross-domain transfer, demonstrating its practical value for scalable, reliable autonomous driving systems.Code will be made publicly available.
PaperID: 1463,   Poster  https://arxiv.org/pdf/2603.25420    
Authors: George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu
Title: VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
Abstract: Recent progress in videoto-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.
PaperID: 1464,   Poster  https://arxiv.org/pdf/2508.09456    
Authors: Junxian Li, Beining Xu, Simin Chen, Jiatong LI, Jingdi Lei, Haodong Zhao, Di Zhang
Title: IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Abstract: Recent advances in visionlanguage models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: thefirstmulti-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves thebestASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding. Code is in the supplementary material.
PaperID: 1465,   Poster  https://arxiv.org/pdf/2512.05016    
Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
Title: Generative Neural Video Compression via Video Diffusion Prior
Abstract: We present GNVCVD, the first DiT-based generative neural video compression framework built upon an advancedvideo generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restorehigh-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherenceunder extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01~bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
PaperID: 1466,   Poster  https://arxiv.org/pdf/2603.02785    
Authors: Zihao Peng, Nan Zou, Jiandian Zeng, Guo Li, Ke Chen, Boyuan Li, Tian Wang
Title: HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning
Abstract: Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full finetuning is communication-heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering generalization to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA’s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.
PaperID: 1467,   Poster  https://arxiv.org/pdf/2602.20328    
Authors: Romario Gualdrón-Hurtado, Roman Jacome, Rafael S. Suárez, Henry Arguello
Title: GSNR: Graph Smooth Null-Space Representation for Inverse Problems
Abstract: Inverse problems in imaging are undetermined, leading to infinitely many solutions consistent with the measurements due to the nontrivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null‑space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information into the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the p-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null‑space variance is captured by p modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.
PaperID: 1468,   Poster  https://arxiv.org/pdf/2603.02893    
Authors: Kaiqiang Xiong, Rui Peng, Jiahao Wu, Zhanke Wang, Jie Liang, Xiaoyun Zheng, Feng Gao, Ronggang Wang
Title: Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
Abstract: 3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (viewdependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts.We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sharp structures.Then appearance is coupled with geometry through cycle-consistency depth filtering, which identifies reliable regions to synthesize virtual views that propagate geometric correctness into appearance optimization. Experiments on LLFF, DTU, and Blender show ICO-GS substantially improves geometry and photometry, consistently outperforming existing sparse-view baselines, particularly in challenging weakly-textured regions.
PaperID: 1469,   Poster  https://arxiv.org/pdf/2509.25896    
Authors: Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen
Title: LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Abstract: As VisionLanguage Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. As a result, these characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.
PaperID: 1470,   Poster  https://arxiv.org/pdf/2603.03969    
Authors: Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi
Title: Scaling Dense Event-Stream Pretraining from Visual Foundation Models
Abstract: Learning versatile, finegrained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability. The source code will be available.
PaperID: 1471,   Poster  https://arxiv.org/pdf/2604.07884    
Authors: Xuemei Jia, Jiawei Du, Hui Wei, Jun Chen, Joey Tianyi Zhou, Zheng Wang
Title: Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Abstract: Highfidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development—ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks.We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity.Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples.During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment.Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.
PaperID: 1472,   Poster  https://arxiv.org/pdf/2503.22179    
Authors: Dailan He, Xiahong Wang, Shulun Wang, Hao Shao, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Title: High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Abstract: Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing faceswapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.
PaperID: 1473,   Poster  https://arxiv.org/pdf/2511.22107    
Authors: Chen Zhang, Yilu An, Ying Chen, Hao Li, Xitong Ling, Lihao Liu, Junjun He, Yuxiang Lin, Zihui Wang, Rongshan Yu
Title: HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
Abstract: Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spotlevel function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.
PaperID: 1474,   Poster  https://arxiv.org/pdf/2510.18269    
Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang
Title: StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Abstract: Unlike offline processing, streaming video visionlanguage models face two fundamental constraints: causality and accumulation.Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks.However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens.Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length.Experiments demonstrate our method achieves 15.7× kv-cache compression, 1.2× lower peak memory and 2× faster TTFT compared to prior SOTA.StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8%/3.7 on RVS.These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
PaperID: 1475,   Poster  https://arxiv.org/pdf/2511.04283    
Authors: Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu
Title: FastGS: Training 3D Gaussian Splatting in 100 Seconds
Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multiview consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-6× training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping.
PaperID: 1476,   Poster  https://arxiv.org/pdf/2602.23956    
Authors: Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang
Title: SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Abstract: Recent advances in textto-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
PaperID: 1477,   Poster  https://arxiv.org/pdf/2602.21835    
Authors: Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu
Title: UniVBench: Towards Unified Evaluation for Video Foundation Models
Abstract: Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for nextgeneration multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are entirely human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.
PaperID: 1478,   Poster  https://arxiv.org/pdf/2603.06034    
Authors: Chunjiang Li, Jianbo Ma, Li Shen, Yanru Chen, Liangyin Chen
Title: Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
Abstract: Multiobject tracking (MOT) in computer vision involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to cost confusion arising from partial occlusion. To address this issue, we present the Occlusion-Aware SORT (OA-SORT) framework, which introduces three innovations: the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). First, OAM assesses the occlusion status (\ie, occlusion severity) of objects and introduces a Gaussian Map (GM) to reduce background influence. Two plug-and-play, training-free components—OAO and BAM—are further proposed. Specifically, OAO leverages the OAM-derived bias from the Kalman Filter's position estimations to compensate positional cost, thereby mitigating confusion. Next, BAM utilizes the OAM-derived bias from the latest trajectory observations to optimize the Kalman Filter’s motion parameters, suppressing estimation fluctuations. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05% on DanceTrack, demonstrating the reusability of the occlusion-aware framework and its components. Ablation studies further validate the effectiveness of the three components, highlighting the key role of the Gaussian Map.
PaperID: 1479,   Poster  https://arxiv.org/pdf/2603.13960    
Authors: Chenru Wang, Yunyi Chen, Zijun Yang, Joey Tianyi Zhou, Chi Zhang
Title: IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
Abstract: Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of largescale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling( S^3 ) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.
PaperID: 1480,   Poster  https://arxiv.org/pdf/2511.18121    
Authors: Ming Zhong, Yuanlei Wang, Liuzhou Zhang, Ruichuan An, Renrui Zhang, Hao Liang, Ming Lu, Ying Shen, Wentao Zhang
Title: VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
Abstract: While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and highlevel concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on \bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities.
PaperID: 1481,   Poster  https://arxiv.org/pdf/2512.02015    
Authors: YaoChih Lee, Zhoutong Zhang, Gabriel Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, Zhengqi Li
Title: Generative Video Motion Editing with 3D Point Tracks
Abstract: Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motioncontrolled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.
PaperID: 1482,   Poster  https://arxiv.org/pdf/2512.10719    
Authors: Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell
Title: SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Abstract: Endto-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code will be released upon acceptance.
PaperID: 1483,   Poster  https://arxiv.org/pdf/2603.13162    
Authors: Junqi Shi, Ming Lu, Xingchen Li, Anle Ke, Ruiqi Zhang, Zhan Ma
Title: DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
Abstract: Diffusionbased image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage.Most existing diffusion codecs employ UNet architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8× spatial downscaling), resulting in excessive computation.In contrast, conventional VAE-based codecs work in much deeper latent domains (16×–64× downscaled), motivating a key question:Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality?To address this, we introduce DiT-IC—an Aligned Diffusion Transformer for Image Compression—which replaces the UNet with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32× downscaled resolution.DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms:(1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction;(2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and(3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference.With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30× faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048×2048 images on a 16 GB laptop GPU. Code will be released.
PaperID: 1484,   Poster  https://arxiv.org/pdf/2512.21867    
Authors: Divyansh Srivastava, Akshay Mehra, Pranav Maneriker, Debopam Sanyal, Vishnu Raj, Vijay Kamarshi, Fan Du, Joshua Kimball
Title: DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
Abstract: Decoderonly autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.
PaperID: 1485,   Poster  https://arxiv.org/pdf/2602.24161    
Authors: Chao Xu, Xiaochen Zhao, xiang deng, Jingxiang Sun, Donglin Di, Zhuo Su, Yebin Liu
Title: GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
Abstract: Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometryaware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
PaperID: 1486,   Poster  https://arxiv.org/pdf/2603.05330    
Authors: Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin, Zach Salehe, Benjamin Attal, Sotiris Nousias, Kiriakos Kutulakos, David B. Lindell
Title: Dark3R: Learning Structure from Motion in the Dark
Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signalto-noise ratios (SNRs) below -4 dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes ~42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.
PaperID: 1487,   Poster  https://arxiv.org/pdf/2602.19449    
Authors: Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu
Title: Decoupling Vision and Language: Codebook Anchored Visual Adaptation
Abstract: Large Vision–Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domainspecific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses.Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes.We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model.This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 14.98% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM’s linguistic capabilities and outperforming peer methods that operate on continuous tokens.
PaperID: 1488,   Poster  https://arxiv.org/pdf/2511.18834    
Authors: Lei Ke, Hubery Yin, Gongye Liu, Zhengyao Lv, Jingcai Guo, Chen Li, Wenhan Luo, Yujiu Yang, Jing LYU
Title: FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories
Abstract: With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlowbased distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used \textttFlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.
PaperID: 1489,   Poster  https://arxiv.org/pdf/2602.19605    
Authors: Chunlei Meng, Guanhong Huang, Rong Fu, Runmin.JIAN Runmin.JIAN, Zhongxue Gan, Chun Ouyang
Title: CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Abstract: information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multilevel semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.
PaperID: 1490,   Poster  https://arxiv.org/pdf/2603.20236    
Authors: Mingchen Song, Xiang Deng, wei jie, Dongmei Jiang, Liqiang Nie, Weili Guan
Title: EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
Abstract: Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and wellestablished model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely on extensive bimanual datasets or fail to effectively leverage pre-trained unimanual policies. To address this limitation, we propose EnergyAction, a novel framework that compositionally transfers unimanual manipulation policies to bimanual tasks through the Energy-Based Models (EBMs). Specifically, our method incorporates three key innovations. First, we model individual unimanual policies as EBMs and leverage their compositional properties to compose left and right arm actions, enabling the fusion of unimanual policies into a bimanual policy. Second, we introduce an energy-based temporal-spatial coordination mechanism through energy constraints, ensuring the generated bimanual actions are both temporal coherence and spatial feasibility. Third, we propose two different energy-aware denoising strategies that dynamically adapt denoising steps based on action quality assessment. These strategies ensure the generation of high-quality actions while maintaining superior computational efficiency compared to fixed-step denoising approaches. Experimental results demonstrate that EnergyAction effectively transfers unimanual knowledge to bimanual tasks, achieving superior performance on both simulated and real-world tasks with minimal bimanual data.
PaperID: 1491,   Poster  https://arxiv.org/pdf/2512.14698    
Authors: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Limin Wang
Title: TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Abstract: This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding.While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain underexplored.In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design.We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset.Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free Reinforcement Learning with Verifiable Rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens-7B, an MLLM that achieves state-of-the-art performance among open-source models and even surpasses proprietary models such as GPT-4o and GPT-5. All codes, data, and models will be released to facilitate future research.
PaperID: 1492,   Poster  https://arxiv.org/pdf/2512.03182    
Authors: Yasser Taha, Grégoire Montavon, Nils Körber
Title: Drainage: A Unifying Framework for Addressing Class Uncertainty
Abstract: Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject outof-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a "drainage node" which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.
PaperID: 1493,   Poster  https://arxiv.org/pdf/2508.04182    
Authors: Peizheng Guo, Jingyao Wang, Wenwen Qiang, Jiahuan Zhou, Changwen Zheng, Gang Hua
Title: COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
Abstract: Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to taskirrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.
PaperID: 1494,   Poster  https://arxiv.org/pdf/2603.12013    
Authors: Zhengdong Zhu, Weiyi Xue, Zuyuan Yang, Wenlve Zhou, Zhiheng Zhou
Title: Pano360: Perspective to Panoramic Vision with Geometric Consistency
Abstract: Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns.Given that multiview geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams.Additionally, to establish an evaluation benchmark and train our network, we collected a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.
PaperID: 1495,   Poster  https://arxiv.org/pdf/2601.05640    
Authors: jingyu li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, XianPeng Lang, Xiatian Zhu, Li Zhang
Title: SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
Abstract: Recent endto-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning.To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning.Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
PaperID: 1496,   Poster  https://arxiv.org/pdf/2511.13648    
Authors: Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu
Title: PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Abstract: 3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysXAnything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193×, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2× and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
PaperID: 1497,   Poster  https://arxiv.org/pdf/2512.22009    
Authors: Sarthak Mehrotra, Sairam Rebbapragada, Mani Bonthu, Vineeth Balasubramanian
Title: iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
Abstract: Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixelrich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow–fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.
PaperID: 1498,   Poster  https://arxiv.org/pdf/2601.10716    
Authors: Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
Title: WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Abstract: We presentWildRayZer, a selfsupervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
PaperID: 1499,   Poster  https://arxiv.org/pdf/2505.20279    
Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Peihao Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
Title: VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Abstract: The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of humanlike visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In parallel, we build a scalable data creation pipeline with over 200K 3D reconstructive instruction-tuning question-answer pairs. To evaluate temporal reasoning, we further introduce the Vision-Spatial-Temporal Intelligence benchmark (VSTI-Bench), which contains over 138.6K question-answer pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments show that VLM-3R supports robust visual-spatial reasoning and improves the understanding of temporal 3D context changes, enabling monocular 3D spatial assistance and embodied reasoning.
PaperID: 1500,   Poster  https://arxiv.org/pdf/2511.17340    
Authors: Yue Yin, Enze Tao, Dylan Campbell
Title: Refracting Reality: Generating Images with Realistic Transparent Objects
Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated imagea panorama centered at the object---using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
PaperID: 1501,   Poster  https://arxiv.org/pdf/2509.09501    
Authors: Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui
Title: Region-Wise Correspondence Prediction between Manga Line Art Images
Abstract: Understanding regionwise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.
PaperID: 1502,   Poster  https://arxiv.org/pdf/2601.01720    
Authors: Xijie Huang, Chengming Xu, Donghao Luo, Xiaobin Hu, Peng Tang, Xu Peng, Jiangning Zhang, Chengjie Wang, Yanwei Fu
Title: FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
Abstract: FirstFrame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.
PaperID: 1503,   Poster  https://arxiv.org/pdf/2604.12270    
Authors: Huang yuan, Sijie Zhao, Jing Cheng, HaoXu HaoXu, Shaohui Jiao
Title: DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
Abstract: Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of highquality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7× speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768×1280) videos at 25\,FPS on a single A100 GPU.
PaperID: 1504,   Poster  https://arxiv.org/pdf/2511.20525    
Authors: Yayuan Li, Aadit Jain, Filippos Bellos, Jason Corso
Title: Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Abstract: We introduce Mistake Attribution (MATT), a task for finegrained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video.MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M—two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
PaperID: 1505,   Poster  https://arxiv.org/pdf/2603.23463    
Authors: Hong Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran
Title: InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
Abstract: Recent diffusionbased models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the target masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as inputs, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill requires no real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.
PaperID: 1506,   Poster  https://arxiv.org/pdf/2601.03357    
Authors: Yingyan Xu, Pramod Rao, Sebastian Weiss, Gaspard Zoss, Markus Gross, Christian Theobalt, Marc Habermann, Derek Bradley
Title: RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
Abstract: 3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex timemultiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.
PaperID: 1507,   Poster  https://arxiv.org/pdf/2511.18801    
Authors: Yichen Yang, Hong Li, Haodong Zhu, Linlin Yang, guojun lei, Sheng Xu, Baochang Zhang
Title: PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
Abstract: Existing autoregressive (AR) methods for generating artistdesigned meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.
PaperID: 1508,   Poster  https://arxiv.org/pdf/2511.19236    
Authors: Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu
Title: End-to-End Language-Action Model for Humanoid Whole Body Control
Abstract: Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely humandriven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language–action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
PaperID: 1509,   Poster  https://arxiv.org/pdf/2603.10538    
Authors: Julian Lorenz, Vladyslav Kovganko, Elias Kohout, Mrunmai Phatak, Daniel Kienzle, Rainer Lienhart
Title: DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents.However, practical deployment in realworld applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research.To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU.This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
PaperID: 1510,   Poster  https://arxiv.org/pdf/2512.17302    
Authors: Kyeongmin Yeo, Yunhong Min, Jaihoon Kim, Minhyuk Sung
Title: MatLat: Material Latent Space for PBR Texture Generation
Abstract: We propose a generative framework for producing highquality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space,MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel–latent spatial correspondence. Ablations studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.
PaperID: 1511,   Poster  https://arxiv.org/pdf/2603.18541    
Authors: Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
Abstract: Crossdomain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region;(2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.
PaperID: 1512,   Poster  https://arxiv.org/pdf/2511.06702    
Authors: Yifan Wang, Yian Zhao, Fanqi Pu, Xiaochen Yang, YANG TANG, Xi Chen, Wenming Yang
Title: SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
Abstract: Existing monocular 3D detectors typically tame the pronounced nonlinear regression of 3D bounding box through decoupled prediction paradigm, which employs multiple branches to estimate geometric center, depth, dimensions, and rotation angle separately.Although this decoupling strategy simplifies the learning process, it inherently ignores the geometric collaborative constraints between different attributes, resulting in the lack of geometric consistency prior, thereby leading to suboptimal performance. To address this issue, we propose novelSpatialProjectionAlignment (SPAN) with two pivotal components: (i).Spatial Point Alignmentenforces an explicit global spatial constraint between the predicted and ground‑truth 3D bounding boxes, thereby rectifying spatial drift caused by decoupled attribute regression. (ii).3D-2D Projection Alignmentensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works. To ensure training stability, we further introduce aHierarchical Task Learningstrategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes. Extensive experiments demonstrate that the proposed method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.
PaperID: 1513,   Poster  https://arxiv.org/pdf/2511.16546    
Authors: Xiaoyue Chen, Yuling Shi, kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, Mingbao Lin
Title: Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Abstract: Visual Autoregressive (VAR) models have demonstrated competitive performance with diffusion models in image generation by adopting a "nextscale" prediction paradigm that significantly reduces inference steps. However, VAR's progressive multi-scale generation leads to severe memory overhead due to KV cache accumulation across all scales, limiting practical deployment. Existing solutions either require training and deploying multiple specialized models or sacrifice generation quality.We observe a critical scale-depth asymmetric dependency in VAR: small scales (low-resolution tokens) are highly sensitive to network depth and require deep layers to capture global semantic information, while large scales (high-resolution tokens) exhibit remarkable robustness to depth reduction.Motivated by this insight, we propose VARiant, a unified supernet framework that enables dynamic depth adjustment within a single model through parameter sharing. Our approach employs an even-spacing layer selection strategy to construct quality-preserving subnetworks, and introduces a dynamic-ratio progressive training strategy that gradually transitions from joint optimization (full network to subnetwork ratio 2:8) to subnetwork optimization (ratio 10:0), effectively resolving the inherent optimization conflicts between full network and subnetworks in supernet training.Extensive experiments on ImageNet demonstrate that our method achieves Pareto-optimal trade-offs between generation quality and inference efficiency: by using full depth (30 layers) for the first 7 scales and a 16-layer subnetwork (50% depth) for subsequent scales, we obtain 50% cache reduction and 1.8× inference speedup with minimal quality loss (FID increases by only 0.3).Unlike approaches requiring deployment of multiple models, our single-model solution eliminates deployment complexity, supports zero-cost runtime depth switching, and seamlessly integrates into standard transformer inference frameworks, making it highly practical for resource-constrained scenarios.
PaperID: 1514,   Poster  https://arxiv.org/pdf/2603.01398    
Authors: junwei zeng, Dong Liang, Sheng-Jun Huang, Kun Zhan, Songcan Chen
Title: Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Abstract: Atmospheric turbulence significantly degrades longrange imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. Our dataset will be publicly released upon acceptance.
PaperID: 1515,   Poster  https://arxiv.org/pdf/2512.02982    
Authors: Xiang Xu, Alan Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu
Title: U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across realworld scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we presentU4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1)uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2)uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
PaperID: 1516,   Poster  https://arxiv.org/pdf/2506.05046    
Authors: Guangzhao Li, Yanming Yang, Chenxi Song, Xiaohong Liu, Chi Zhang
Title: FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Abstract: Textdriven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency.To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion–appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory. Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.
PaperID: 1517,   Poster  https://arxiv.org/pdf/2601.13664    
Authors: Tiancheng Fang, Bowen Pan, Lingxi Chen, Jiangjing Lyu, Chengfei Lv, Chaoyue Niu, Fan Wu
Title: VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
Abstract: We propose VIAFormer, a VoxelImage Alignment Trans-former model designed for Multi-view Conditioned Voxel Refinement—the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.
PaperID: 1518,   Poster  https://arxiv.org/pdf/2602.19112    
Authors: Qinfeng Xiao, Guofeng Mei, Bo Yang, Zhang Liying, Jian Zhang, YICK Kit-lun
Title: Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
Abstract: Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on nearisometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.
PaperID: 1519,   Poster  https://arxiv.org/pdf/2603.22852    
Authors: Chengxin Lv, Yihui Li, Hongyu Yang, Yunhong Wang
Title: Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Abstract: 3D semantic occupancy prediction is crucial for autonomous driving, yet visiononly approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We presentGau-Occ, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing.To enhance geometric completeness, a learnedLiDAR Completion Diffuser (LCD)trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors.To further integrate multi-view image semantics, we introduceGaussian Anchor Fusion (GAF), a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By constructing locally aggregated Gaussian descriptors that capture spatial consistency and semantic discriminability, GAF facilitates accurate feature association across modalities.Through anchor-driven refinement of Gaussian attributes, Occ-GS supports detailed 3D occupancy prediction. Extensive experiments across challenging benchmarks demonstrate that Occ-GS achieves state-of-the-art performance.
PaperID: 1520,   Poster  https://arxiv.org/pdf/2604.05354    
Authors: Haochen Yang, Baolu Li, Lei Li, Delin Ren, Jiacheng Guo, Minghai Qin, Tianyun Zhang, Hongkai Yu
Title: Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
Abstract: The LiDAR sensor based multiagent and single-agent perception has shown promising performance in the environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance to the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised way.
PaperID: 1521,   Poster  https://arxiv.org/pdf/2511.21317    
Authors: Weitian Wang, Lukas Meiner, Shubham rai, Cecilia Parra, Akash Kumar
Title: HTTM: Head-wise Temporal Token Merging for Faster VGGT
Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform allto-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7× acceleration with negligible performance drops in a GPU-based inference.
PaperID: 1522,   Poster  https://arxiv.org/pdf/2511.13285    
Authors: Yunjie Yu, Jingchen Wu, Junchen Zhu, Chunze Lin, Guibin Chen
Title: Skyreels-Text: Fine-grained Font-Controllable Text Editing for Poster Design
Abstract: Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in finegrained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present Skyreels-Text, a novel font-controllable framework for precise poster text editing. Our method supports simultaneous editing of multiple text regions with distinct typographic styles: users simply provide cropped glyph patches from reference images, and our model synthesizes the desired content in a visually matching font, without requiring font labels or fine-tuning. Extensive experiments on multiple datasets, including handwrittent text benchmarks, Skyreels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.
PaperID: 1523,   Poster  https://arxiv.org/pdf/2510.06638    
Authors: Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, hui zhang, Weicheng Zhu, Xin Zhang
Title: Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces
Abstract: Knowledgebased Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant,IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We proposeStaR-KVQA, a framework that equips IK-KVQA withdual-path structured reasoning traces—symbolic relation paths over text and vision together with path-grounded natural-language explanations—to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. With a single open-source MLLM, StaR-KVQA constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to+11.3%higher answer accuracy on OK-VQA over the strongest baseline.
PaperID: 1524,   Poster  https://arxiv.org/pdf/2603.25872    
Authors: Runsheng Bai, Chengyu Zhang, Yangdong Deng
Title: DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
Abstract: Diffusion models have achieved remarkable success in generating highfidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip connections to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of \tfrac1n or \tfrac2n+1, depending on whether the conservative or aggressive mode is used, where n denotes the number of devices. Empirically, DRiffusion attains 1.5×–4× speedup on Stable Diffusion 2.1 with minimal degradation in generation quality. On MS-COCO dataset, both FID and CLIP remain close to those of the original sampler: averaged across configurations, DRiffusion even improves FID by 0.45 and incurs a negligible 0.06 drop in CLIP score. These results show that DRiffusion delivers substantial acceleration while largely preserving perceptual quality.
PaperID: 1525,   Poster  https://arxiv.org/pdf/2602.23791    
Authors: Hyejin Park, Jiwon Yoon, Sumin Park, Suree Kim, Sinae Jang, Eunsoo Lee, Dongmin Kang, Dongbo Min
Title: FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
Abstract: Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the staindependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of stain-aware FQA, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus–rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose FluoMix, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose FluoCLIP, a two-stage vision-language framework that leverages CLIP's alignment capability to interpret focus quality in the context of biological staining. In the stain-grounding phase, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the stain-guided ranking phase, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for stain-aware FQA, and FluoCLIP achieves strong generalization across diverse fluorescence microscopy conditions.
PaperID: 1526,   Poster  https://arxiv.org/pdf/2508.16557    
Authors: Tianyi Zhang, Zheng-Peng Duan, Chun-Le Guo, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chongyi Li
Title: Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
Abstract: Diffusionbased real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step.
PaperID: 1527,   Poster  https://arxiv.org/pdf/2511.12982    
Authors: Xuankun Rong, Wenke Huang, Tingfeng Wang, Daiguo Zhou, Bo Du, Mang Ye
Title: SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instructionfollowing capabilities, yet their expanded modality space introduces new compositional safety risks that emerge from complex text–image interactions.Such cross-modal couplings can produce unsafe semantics even when individual inputs are benign, exposing the fragile safety awareness of current MLLMs.While recent works enhance safety by guiding models to reason about potential risks, unregulated reasoning traces may compromise alignment; although Group Relative Policy Optimization (GRPO) offers self-rewarded refinement without human supervision, it lacks verifiable signals for reasoning safety.To address this, we proposeSafeGRPOa self-rewarded multimodal safety alignment framework that integrates rule-governed reward construction into GRPO, enabling interpretable and verifiable optimization of reasoning safety. Built upon the constructedSafeTag-VL-3Kdataset with explicit visual, textual, and combined safety tags, SafeGRPO performsstep-guided safety thinkingto enforce structured reasoning and behavior alignment, substantially improving multimodal safety awareness, compositional robustness, and reasoning stability across diverse benchmarks without sacrificing general capabilities.
PaperID: 1528,   Poster  https://arxiv.org/pdf/2603.00667    
Authors: Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang
Title: Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
Abstract: Computational pathology has advanced rapidly in recent years, driven by domainspecific image encoders and growing interest in using vision–language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question–answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
PaperID: 1529,   Poster  https://arxiv.org/pdf/2508.04565    
Authors: Yunbi Liu, Enqi Tang, Shiyu Li, hui shuai, Lei Ma, Juncheng Li, Kuai Yu, Shu Lou, Yongchu Pan, Qingshan Liu
Title: TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning
Abstract: Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly predict transformation matrices for the misaligned tooth point cloud via pointto-point geometric constraints to achieve tooth alignment. Nevertheless, these matrices are likely to exhibit clinical-specific distributions, which deterministic constraints fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is assisted by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. We validate our method on a challenge dataset from clinical practice and an extra orthodontic dataset. Its efficacy was confirmed through effective ablation studies and comparative analyses, highlighting its potential for application in orthodontic treatment.
PaperID: 1530,   Poster  https://arxiv.org/pdf/2604.19386    
Authors: Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, Zixu Li
Title: Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the "small loss hypothesis", but the unique semantic ambiguity in NTC, such as "partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic "representation pollution". To address this critical challenge, we propose a novel "ExpertProxy-Diversion" decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) The External Prior Arbitration (EPA) module, which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI) module, which efficiently guides a lightweight proxy "arbiter" to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR) module, which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.
PaperID: 1531,   Poster  https://arxiv.org/pdf/2603.04890    
Authors: Min Tan, Junchao Ma, Yinfu FENG, Jiajun Ding, Wenwen Pan, Tingting Han, Qian Zheng, Zhenzhong Kuang, Zhou Yu
Title: FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
Abstract: Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacypreserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.
PaperID: 1532,   Poster  https://arxiv.org/pdf/2604.03134    
Authors: Meihua Li, Yang Zhang, Weizhao He, Hu Qu, Yisong Li
Title: SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
Abstract: FewShot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.
PaperID: 1533,   Poster  https://arxiv.org/pdf/2511.22850    
Authors: Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang, Lihua Zhang
Title: Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Abstract: Document understanding is a longstanding practical task. Vision-Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single-page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the model's judgment. While retrieval-augmented generation mitigates this issue by filtering for question-relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi-agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse-to-fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence-dense multimodal context to generate the final prediction. The SLEUTH framework is model-agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long-document benchmarks, achieving SOTA results. Ablation studies verify each module’s effectiveness and confirm the benefits of our hierarchical refinement paradigm.
PaperID: 1534,   Poster  https://arxiv.org/pdf/2603.24176    
Authors: Wanying Qu, Jianxiong Gao, Wei Wang, Yanwei Fu
Title: Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
Abstract: Capturing dynamic spatiotemporal neural activity is essential for understanding largescale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.
PaperID: 1535,   Poster  https://arxiv.org/pdf/2506.09082    
Authors: Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao
Title: AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as generalpurpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) Instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than VFMs' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities in a single question, making it difficult to determine whether errors arise from the lack of all required abilities or just one key ability. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs), foundational skills such as localization, depth estimation, and spatial understanding, which collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
PaperID: 1536,   Poster  https://arxiv.org/pdf/2603.27970    
Authors: Nghia Huu Vu, Tuong Do, Khang Nguyen, Baoru Huang, Nhat Le, Binh Xuan Nguyen, Erman Tjiputra, Quang D. Tran, Ravi Prakash, Te-Chuan Chiu, Anh Nguyen
Title: AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
Abstract: Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating objectand scene-level semantics is not straightforward; for example, 3D instance identification often struggles with small, interactable, functional parts (i.e., knobs, handles, etc.). In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach against other methods. Our code and dataset will be made publicly available.
PaperID: 1537,   Poster  https://arxiv.org/pdf/2603.25107    
Authors: Yuqiao Zeng, Xu Wang, Tengfei Liang, Yiqing Hao, Yi Jin, Hui Yu
Title: Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on largescale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for Difficulty-Aware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.
PaperID: 1538,   Poster  https://arxiv.org/pdf/2603.02390    
Authors: Hymalai Bello, Lala Shakti Swarup Ray, Joanna Sorysz, Sungho Suh, Paul Lukowicz
Title: OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
Abstract: Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semirealistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment. The dataset and code are available at (Removed for Anonymous CVPR Submission).
PaperID: 1539,   Poster  https://arxiv.org/pdf/2511.18810    
Authors: Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, Yadan Luo
Title: MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Abstract: Recent VisionLanguage-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability:(1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify.(2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination.To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design.MergeVLA introduces sparsely activated LoRA adapters via task masks to retain consistent parameters and reduce irreconcilable conflicts in the VLM.Its action expert replaces self-attention with cross-attention-only blocks to keep specialization localized and composable.When the task is unknown, it uses a test-time task router to adaptively select the appropriate task mask and expert head from the initial observation, enabling unsupervised task inference.Across LIBERO, LIBERO-Plus, RoboTwin, and multi-task experiments on the real SO101 robotic arm, MergeVLA achieves performance comparable to or even exceeding individually finetuned experts, demonstrating robust generalization across tasks, embodiments, and environments.The source code can be found in the supplementary material for reference.
PaperID: 1540,   Poster  https://arxiv.org/pdf/2603.26109    
Authors: Jiaming Liang, Yifeng Zhan, Chunlin Liu, Weihua Zheng, bingye Peng, Qiwei Liang, Boyang Cai, Xiaochun Mai, Qiang Nie
Title: SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
Abstract: Openvocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts.Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector’s ability to discriminate camouflaged objects from background objects. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.
PaperID: 1541,   Poster  https://arxiv.org/pdf/2511.14183    
Authors: Jingdong Zhang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes, Yizhou Wang, Haoran You, Xiaoyang Liu, Yuqian Zhou, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Xin Li, Wenping Wang, Xiaohang Zhan
Title: UniSER: A Foundation Model for Unified Soft Effects Removal
Abstract: Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent largescale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.
PaperID: 1542,   Poster  https://arxiv.org/pdf/2601.12901    
Authors: Hongchen Li, Tianyu Li, Jiazhi Yang, Mingyang Shang, Gaoqiang Wu, Caojun Wang, Haochen Tian, Zengrong Lin, Zhihui Hao, XianPeng Lang, Jia Hu, Hongyang Li
Title: PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning
Abstract: Diffusionbased planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance the robustness of diffusion planners through reward-oriented optimization in a generation–evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10 times faster rollout compared to native nuPlan. Extensive experiments shows that PlannerRFT yields state-of-the-art performance with distinct behaviors emerging during the learning process. Code and simulator would be released.
PaperID: 1543,   Poster  https://arxiv.org/pdf/2512.10548    
Authors: Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Title: Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various visionlanguage tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential ``blink-like'' process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
PaperID: 1544,   Poster  https://arxiv.org/pdf/2512.21507    
Authors: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
Title: SVBench: Evaluation of Video Generation Models on Social Reasoning
Abstract: Recent textto-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text–video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans—who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues—current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
PaperID: 1545,   Poster  https://arxiv.org/pdf/2602.18936    
Authors: Yu Li, Yujun Cai, Chi Zhang
Title: CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
Abstract: Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. LowRank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.
PaperID: 1546,   Poster  https://arxiv.org/pdf/2503.15633    
Authors: Amelie Royer, Moritz Böhle, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, Patrick Perez
Title: Vision-Speech Models: Teaching Speech Models to Converse about Images
Abstract: The recent successes of VisionLanguage models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.
PaperID: 1547,   Poster  https://arxiv.org/pdf/2601.05848    
Authors: Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Chen Sun
Title: Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Abstract: Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives—such as elastic collisions and falling dominos—teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zeroshot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines.
PaperID: 1548,   Poster  https://arxiv.org/pdf/2512.10953    
Authors: Yiyang Lu, Qiao Sun, Xianbang Wang, Zhicheng Jiang, Hanhong Zhao, Kaiming He
Title: Bidirectional Normalizing Flow: From Data to Noise and Back
Abstract: Normalizing Flows (NFs) are a principled framework for generative modeling, consisting of a forward process and a reverse process. The forward process maps data to a simple prior distribution, while the reverse process generates samples by inverting this mapping. Traditional approaches focus on designing expressive forward transformations under strict requirement of explicitly invertibility, so that the reverse process can serve as their exact analytic inverse. Recent advances such as TARFlow enhance the forward model with Transformers and autoregressive structures, achieving stateof-the-art generation quality—but at the expense of slow sampling due to autoregressive decoding. In this work, we introduce Bidirectional Normalizing Flow (BiFlow), a new framework that removes the need for an exact analytic inverse by learning a flexible, data-driven reverse model to approximate the inverse mapping. This relaxation enables richer architectures and loss formulations while preserving the probabilistic foundation of NFs. BiFlow performs direct, single-forward (1-NFE) generation, eliminating autoregressive bottlenecks and achieving up to two orders of magnitude faster sampling with improved generation quality. We hope this work encourages rethinking Normalizing Flows as direct, flexible, and efficient generative models.
PaperID: 1549,   Poster  https://arxiv.org/pdf/2511.20034    
Authors: Xingyue Lin, Shuai Peng, Xiangyu Xie, Jianhua Zhu, Yuxuan Zhou, Liangcai Gao
Title: Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
Abstract: Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex realworld images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light–shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.
PaperID: 1550,   Poster  https://arxiv.org/pdf/2602.18882    
Authors: Mohammad Asim, Christopher Wewer, Jan Lenssen
Title: SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
Abstract: We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or viewaligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. A diffusion transformer enables scene generation on the compressed token space. We show that the compression is two orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 8 seconds, achieving a much better quality-speed tradeoff than previous paradigms.Our code and trained models will be released upon acceptance of the paper.
PaperID: 1551,   Poster  https://arxiv.org/pdf/2603.27238    
Authors: Yi Feng, Junwu E, Zizhan Guo, Yu Ma, Hanli Wang, Rui Fan
Title: An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Abstract: Panoptic occupancy prediction aims to jointly infer voxelwise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset will be released upon publication.
PaperID: 1552,   Poster  https://arxiv.org/pdf/2602.22419    
Authors: Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander
Title: CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
Abstract: CLIP models learn transferable multimodal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP’s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
PaperID: 1553,   Poster  https://arxiv.org/pdf/2602.21172    
Authors: Ishaan Singh Rawal, Shubh Gupta, Yihan Hu, Wei Zhan
Title: NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Abstract: VisionLanguage-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. Current VLAs face two challenges: (1) they require extensive datasets annotated with reasoning traces, and (2) these traces greatly increase token counts, inflating training and inference costs. We propose NoRD (No Reasoning for Driving), a data- and inference-efficient VLA that addresses both. Compared to existing VLAs, NoRD achieves competitive performance while being fine-tuned on atleast <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. Our approach applies Reinforcement Learning (RL) to fine-tune a Supervised fine-tuning (SFT) policy trained on a small, reasoning-free dataset. However, we observe that the standard RL algorithm, Group Relative Policy Optimization (GRPO), fails to yield significant improvements over this data-efficient SFT policy. We find that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NoRD overcomes this limitation by incorporating Dr.GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NoRD achieves competitive performance on Waymo and NAVSIM without large datasets, reasoning or additional inputs, enabling scalable, data-efficient training, and fast inference.
PaperID: 1554,   Poster  https://arxiv.org/pdf/2512.17312    
Authors: Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao
Title: CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Abstract: Recent releases such as o3 highlight humanlike “thinking with images” reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks.We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning.To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse.Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning.Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models. Our code is available.
PaperID: 1555,   Poster  https://arxiv.org/pdf/2601.05251    
Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi
Title: Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
Abstract: We propose Mesh4D, a feedforward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.
PaperID: 1556,   Poster  https://arxiv.org/pdf/2602.23783    
Authors: Bukun Huang, Benlei Cui, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Haiwen Hong, Jingqun Tang
Title: Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Abstract: Textto-image (T2I) diffusion models currently lack an efficient mechanism for early quality assessment, forcing costly random trial-and-error in scenarios requiring multiple generations (e.g., iterating on prompts, agent-based image generation, flow-grpo). To address this, we first reveal a strong correlation between the attention distribution in the early diffusion process and the final image quality. Building upon this insight, we introduceDiffusion Probe, a pioneering framework that leverages the model’s internal cross-attention maps as a predictive signal. We propose a lightweight predictor, trained to establish a direct mapping from statistical properties of these nascent cross-attention distributions—extracted from the initial denoising steps—to the final image’s comprehensive quality. This allows our probe to accurately forecast various aspects of image quality, regardless of the specific ground-truth quality metric, long before full synthesis is complete.We empirically validate the reliability and generalizability of Diffusion Probe through its consistently strong predictive accuracy across a wide spectrum of conditions. On diverse T2I models (e.g., SDXL, FLUX, Qwen-Image), throughout broad early-denoising windows, across various resolutions, and with different quality metrics, it achieveshigh correlation (PCC > 0.7)andclassification performance (AUC-ROC > 0.9). This intrinsic reliability is further demonstrated in practice by successfully optimizing T2I workflows that benefit from early, quality-guided decisions, such asPrompt Optimization,Seed Selection, andAccelerated RL Training. In these applications, the probe's early signal enables more targeted sampling strategies, preempting costly computations on low-potential paths. This yields a dual benefit: a significant reduction in computational overhead and a simultaneous improvement in final outcome quality, establishing Diffusion Probe as a model-agnostic and broadly applicable tool poised to revolutionize T2I efficiency.
PaperID: 1557,   Poster  https://arxiv.org/pdf/2601.00285    
Authors: Jun-Jee Chao, Volkan Isler
Title: SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multiview videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.
PaperID: 1558,   Poster  https://arxiv.org/pdf/2604.09445    
Authors: Mohammad Omama, Gabriele Berton, Eric Foxlin, Yelin Kim
Title: AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Abstract: Precise and realtime visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching.Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.
PaperID: 1559,   Poster  https://arxiv.org/pdf/2603.25072    
Authors: Ma Junpeng, Sashuai zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, Shanghang Zhang, Jian Pu
Title: GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decisionmaking, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we proposeGIFT:GlobalIrreplaceabilityFrameTargeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of12.5%across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling. Code will be released soon.
PaperID: 1560,   Poster  https://arxiv.org/pdf/2603.20850    
Authors: Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan
Title: Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
Abstract: Understanding handobject interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view bare-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.
PaperID: 1561,   Poster  https://arxiv.org/pdf/2602.24181    
Authors: Rishabh Kabra, Maks Ovsjanikov, Drew Hudson, Ye Xia, Skanda Koppula, André Araujo, Joao Carreira, Niloy J. Mitra
Title: A Mixed Diet Makes DINO an Omnivorous Vision Encoder
Abstract: Pretrained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes ``omnivorous'' by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
PaperID: 1562,   Poster  https://arxiv.org/pdf/2511.14918    
Authors: Zefan Yang, Ge Wang, James Hendler, Mannudeep Kalra, Pingkun Yan
Title: X-WIN: Building Chest Radiograph World Model via Predictive Sensing
Abstract: Chest Xray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.
PaperID: 1563,   Poster  https://arxiv.org/pdf/2603.00324    
Authors: Arya Fayyazi, Haleh Akrami
Title: Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Abstract: We present Proofof-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set \Gamma^(t)_\delta(x), yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget—expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy–compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
PaperID: 1564,   Poster  https://arxiv.org/pdf/2604.09651    
Authors: Xinyuan An, Tao Luo, gengyun peng, Yaobing Wang, Kui Ren, Dongxia Wang
Title: FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like \pi_0 showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism—the vector field dynamics—presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics.We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel \tau-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.
PaperID: 1565,   Poster  https://arxiv.org/pdf/2512.00961    
Authors: Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng
Title: Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Abstract: Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goaldriven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward–backward representation that represents the probability of visiting the goal state from a given state–action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
PaperID: 1566,   Poster  https://arxiv.org/pdf/2502.00653    
Authors: ZIYI YIN, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenglong Ma
Title: Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Abstract: While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in whitebox scenarios. To address these challenges and bolster MLLM robustness, we introduce SAFEMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SAFEMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SAFEMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SAFEMLLM across multiple MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SAFEMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.
PaperID: 1567,   Poster  https://arxiv.org/pdf/2512.02172    
Authors: Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker, Amitabh Varshney
Title: SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
Abstract: 3D Gaussian Splatting (3DGS) enables highquality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.
PaperID: 1568,   Poster  https://arxiv.org/pdf/2511.20647    
Authors: Tahira Kazimi, Connor Dunlop, Pinar Yanardag
Title: Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
Abstract: While recent textto-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
PaperID: 1569,   Poster  https://arxiv.org/pdf/2512.00076    
Authors: Minghe Gao, Juncheng Li, Yuze Lin, Xuqi Liu, Jiaming Ji, Xiaoran Pan, Zihan Xu, Xian Li, Mingjie Li, Wei Ji, Rong Wei, Rui Tang, Qizhou Wang, Kai Shen, Jun Xiao, Qi Wu, Siliang Tang, Yueting Zhuang
Title: Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
Abstract: We contend that embodied learning is fundamentally a lifecycle problem rather than a singlestage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks the improvement loop and reverts to one-shot training. Arcadia delivers consistent gains on navigation and manipulation benchmarks and transfers robustly to physical robots, indicating that a tightly coupled lifecycle: continuous real-world data acquisition, generative simulation update, and shared-representation learning, supports lifelong improvement and end-to-end generalization. We release standardized interfaces enabling reproducible evaluation and cross-model comparison in reusable environments, positioning Arcadia as a scalable foundation for general-purpose embodied agents.
PaperID: 1570,   Poster  https://arxiv.org/pdf/2602.23040    
Authors: Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li, Tao Lu, Aayush Prakash, Srinath Sridhar
Title: PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
Abstract: Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve highquality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., HEVC, FFV1) without quality loss, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view 4D dataset to date, featuring more than 50 synchronized 360 cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.
PaperID: 1571,   Poster  https://arxiv.org/pdf/2603.05255    
Authors: Gong Chen, Chaokun Zhang, Tao Tang, Pengcheng Lyu, Feng Li, Xin Xie
Title: CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Abstract: Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in realworld multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.
PaperID: 1572,   Poster  https://arxiv.org/pdf/2511.17209    
Authors: Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen
Title: Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
Abstract: We introduce SPECTRE, a fully transformerbased foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision–language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision–language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
PaperID: 1573,   Poster  https://arxiv.org/pdf/2603.19053    
Authors: Phuc Pham, Uy Tran, Binh-Son Hua, Phong Nguyen
Title: SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large visionlanguage models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed via an efficient inverse mapping process, incorporating remeshing and dynamic stitching algorithms, thereby eliminating the need for physical re-simulation. Extensive experiments on the Multimodal GarmentCodeData benchmark demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.
PaperID: 1574,   Poster  https://arxiv.org/pdf/2604.00530    
Authors: Tianren Ma, Mingxiang Liao, Xijin Zhang, Qixiang Ye
Title: AceTone: Bridging Words and Colors for Conditional Image Grading
Abstract: Color affects how we interpret image style and emotion. Previous color grading methods rely on patchwise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE-based tokenizer which compresses a 3×32^3 LUT vector to 64 discrete tokens with \Delta \textE<2 fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone’s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading. The models and datasets will be publicly available.
PaperID: 1575,   Poster  https://arxiv.org/pdf/2512.19311    
Authors: Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang
Title: MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Abstract: This paper studies the trainingtesting discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at the training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the training performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network at each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation, validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 × 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 × 512.
PaperID: 1576,   Poster  https://arxiv.org/pdf/2508.15902    
Authors: Léore Bensabath, Mathis Petrovich, Gul Varol
Title: Text-Driven 3D Hand Motion Generation from Sign Language Data
Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a largescale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model (HandMDM), that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
PaperID: 1577,   Poster  https://arxiv.org/pdf/2511.16957    
Authors: Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang, Xintong Han, Zhuo Chen, Beibei Wang, Chunchao Guo
Title: MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
Abstract: Physicallybased rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-channel sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks—text-to-material generation, image-to-material generation, and intrinsic decomposition—within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native 1024×1024 synthesis that substantially surpasses existing approaches in both quality and diversity.
PaperID: 1578,   Poster  https://arxiv.org/pdf/2512.22238    
Authors: Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma
Title: Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
Abstract: Largescale vision–language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teacher. However, distilling knowledge from large teacher to small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking teacher and reinforcing student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks and non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher from mask to gradually increase the teacher capacity during training. This strategy allows the student to learn richer representations of teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of their responses' transferability from teacher to student. Unlike online think–answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling the students to achieve strong performance without requiring the think–answer process. Extensive experiments across diverse VLM benchmarks demonstrate that Masters outperforms existing compact VLMs and partially surpasses large ones, while being far more efficient. Moreover, gradually increasing the teacher sizes during distillation (e.g., from 14B to 38B) yields smoother convergence and stronger generalization than one-shot distillation (e.g., 38B), revealing a scalable path toward efficient and deployable VLMs.
PaperID: 1579,   Poster  https://arxiv.org/pdf/2512.00489    
Authors: Jingyu Guo, Emir Konuk, Fredrik Strand, Christos Matsoukas, Kevin Smith
Title: Learning What Helps: Task-Aligned Context Selection for Vision Tasks
Abstract: Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present TaskAligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective.By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help.Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.
PaperID: 1580,   Poster  https://arxiv.org/pdf/2604.05079    
Authors: zhongyu yang, Zuhao Yang, SHUO ZHAN, Tan Yue, Wei Pang, Yingfang Yuan
Title: SVAgent: Storyline-guided Long Video Understanding via Cross-modal Multi-agent Collaboration
Abstract: Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storylineguided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.
PaperID: 1581,   Poster  https://arxiv.org/pdf/2511.17699    
Authors: Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Baghshah
Title: Understanding Counting Mechanisms in Large Language and Vision-Language Models
Abstract: This paper examines how large language models (LLMs) and large visionlanguage models (LVLMs) represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze model behavior through causal mediation and activation patching. To this end, we design a specialized tool, CountScope, for mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region and transferable between contexts. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition.Models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.
PaperID: 1582,   Poster  https://arxiv.org/pdf/2511.11301    
Authors: Ruoxi Cheng, Hao-Xuan Ma, Teng Ma, Hongyi Zhang
Title: EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Abstract: Large VisionLanguage Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores.To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.
PaperID: 1583,   Poster  https://arxiv.org/pdf/2509.09667    
Authors: Zhengdi Yu, Simone Foti, Linguang Zhang, Amy Zhao, Cem Keskin, Stefanos Zafeiriou, Tolga Birdal
Title: Geometric Neural Distance Fields for Learning Human Motion Priors
Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusionbased methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to “roll out” realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
PaperID: 1584,   Poster  https://arxiv.org/pdf/2603.22070    
Authors: Xingyu Zhu, Yi Liang, Shuo Wang, Wenbo Zhu, Yongliang Wu, Beier Zhu, Hanwang Zhang
Title: Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Abstract: Large multimodal 3D visionlanguage models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable.To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training.Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.
PaperID: 1585,   Poster  https://arxiv.org/pdf/2601.16148    
Authors: Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier
Title: ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts productionready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
PaperID: 1586,   Poster  https://arxiv.org/pdf/2603.11618    
Authors: Jiin Im, Sisung Liu, Je Hyeong Hong
Title: Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
Abstract: Establishing semantic correspondence without supervision is essential for handling diverse inthe-wild images where annotations are scarce.While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features.In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization.The resulting probabilistic transport plan provides a structurally consistent, yet noisy, supervisory signal.We introduce a soft-target loss, which dynamically blends guidance from this plan with the network's current predictions, to build a learning framework robust to this noise.SoY achieves state-of-the-art performance on the SPair-71k and AP-10k datasets, establishing a new benchmark in unsupervised semantic correspondence.Code is in the supplement.
PaperID: 1587,   Poster  https://arxiv.org/pdf/2512.14234    
Authors: Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, Ehsan Adeli
Title: ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
Abstract: Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task—cospeech gesture or text-to-motion that maps a fixed utterance to motion clips—without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue–motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond “speech-conditioned motion generation” toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction.
PaperID: 1588,   Poster  https://arxiv.org/pdf/2406.09293    
Authors: Giuseppe Vecchio
Title: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
Abstract: We introduceStableMaterials, a novel approach for generating photorealistic physicalbased rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). Our method employs adversarial training to distill knowledge from existing large-scale image generation models, minimizing the reliance on annotated data and enhancing the diversity in generation. This distillation approach aligns the distribution of the generated materials with that of image textures from an SDXL model, enabling the generation of novel materials that are not present in the initial training dataset.Furthermore, we employ a diffusion-based refiner model to improve the visual quality of the samples and achieve high-resolution generation. Finally, we distill a latent consistency model for fast generation in just four steps and propose a new tileability technique that removes visual artifacts typically associated with fewer diffusion steps. We detail the architecture and training process of StableMaterials, the integration of semi-supervised training within existing LDM frameworks.Comparative evaluations with state-of-the-art methods show the effectiveness of StableMaterials, highlighting its potential applications in computer graphics and beyond.StableMaterials will be made publicly available.
PaperID: 1589,   Poster  https://arxiv.org/pdf/2602.22140    
Authors: Dhruv Verma, Andrew Qiu, Roberto Rangel, Ayandev Barman, Hao Yang, Chenjia Hu, Fengqi Zhang, Roman Genov, David B. Lindell, Kiriakos Kutulakos, Alex Mariakakis
Title: Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
Abstract: We present Lumosaic, a compact active hyperspectral video system designed for realtime capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame’s exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400–700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.
PaperID: 1590,   Poster  https://arxiv.org/pdf/2602.12617    
Authors: Modi Jin, Yiming Zhang, Bo-Yuan Sun, Dingwen Zhang, Ming-Ming Cheng, Qibin Hou
Title: GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
Abstract: This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving finegrained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans. Pretrained model and data will be openly available.
PaperID: 1591,   Poster  https://arxiv.org/pdf/2512.13080    
Authors: Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu
Title: Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Abstract: VisionLanguage-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that enables models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features, and aligns the two through visual-physical alignment pretraining. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
PaperID: 1592,   Poster  https://arxiv.org/pdf/2512.13093    
Authors: Mingqi Yuan, Tao Yu, Haolin Song, Bo Li, Xin Jin, Hua Chen, Wenjun Zeng
Title: PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
Abstract: Achieving efficient and robust wholebody control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we proposePvP, aProprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we developSRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
PaperID: 1593,   Poster  https://arxiv.org/pdf/2603.03920    
Authors: Yuhan Xie, Chen Lyu
Title: BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning
Abstract: Model Merging (MM) has emerged as a scalable paradigm for multitask learning (MTL), enabling multiple task-specific models to be integrated without revisiting the original training data. Despite recent progress, the reliability of MM under test-time distribution shift remains insufficiently understood. Most existing MM methods typically assume that test data are clean and distributionally aligned with both the training and auxiliary sources. However, this assumption rarely holds in practice, often resulting in biased predictions with degraded generalization. To address this issue, we present BD-Merging, a bias-aware unsupervised model merging framework that explicitly models uncertainty to achieve adaptive reliability under distribution shift. First, BD-Merging introduces a joint evidential head that learns uncertainty over a unified label space, capturing cross-task semantic dependencies in MM. Second, building upon this evidential foundation, we propose an Adjacency Discrepancy Score (ADS) that quantifies evidential alignment among neighboring samples. Third, guided by ADS, a discrepancy-aware contrastive learning mechanism refines the merged representation by aligning consistent samples and separating conflicting ones. Combined with general unsupervised learning, this process trains a debiased router that adaptively allocates task-specific or layer-specific weights on a per-sample basis, effectively mitigating the adverse effects of distribution shift. Extensive experiments across diverse tasks demonstrate that BD-Merging achieves superior effectiveness and robustness compared to state-of-the-art MM baselines.
PaperID: 1594,   Poster  https://arxiv.org/pdf/2512.21038    
Authors: Yiwen Shan, Haiyu Zhao, Peng Hu, Xi Peng, Yuanbiao Gou
Title: Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
Abstract: Selfsupervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.
PaperID: 1595,   Poster  https://arxiv.org/pdf/2603.26211    
Authors: Shrinidhi Kumbhar, Haofu Liao, srikar appalaraju, Kunwar Yashraj Singh
Title: Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Abstract: Autoregressive (AR) vision–language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision–language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDAV for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.
PaperID: 1596,   Poster  https://arxiv.org/pdf/2510.10489    
Authors: Li jiaye, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu
Title: Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Abstract: Transformers rely on explicit positional encoding to model structure in data. WhileRotary Position Embedding (RoPE) excels in 1D domains, its application to imx0002age generation reveals significant limitations such as finegrained spatial relationmodeling, color cues, and object counting. This paper identifies key limitationsof standard multi-dimensional RoPE—rigid frequency allocation, axis-wise index0002pendence, and uniform head treatment—in capturing the complex structural biasesrequired for fine-grained image generation. We propose HARoPE, a head-wiseadaptive extension that inserts a learnable linear transformation parameterized viasingular value decomposition (SVD) before the rotary mapping. This lightweightmodification enables dynamic frequency reallocation, semantic alignment of rotaryplanes, and head-specific positional receptive fields while rigorously preservingRoPE’s relative-position property. Extensive experiments on class-conditional Imax0002geNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPEconsistently improves performance over strong RoPE baselines and other extenx0002sions. The method serves as an effective drop-in replacement, offering a principledand adaptable solution for enhancing positional awareness in transformer-basedimage generative models.
PaperID: 1597,   Poster  https://arxiv.org/pdf/2603.29855    
Authors: Jianyu LAI, Sixiang Chen, Jialin Gao, Hengyu Shi, Zhongying Liu, Fuxiang Zhai, Junfeng Luo, Xiaoming Wei, Lujia Wang, Lei Zhu
Title: PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation
Abstract: Recent advancements in the textrendering capabilities of image generation models have made the end-to-end creation of graphic design content, such as posters, increasingly feasible. However, existing reward models fail to accurately assess the quality of graphic design. They primarily focus on global image aesthetics and lack the capacity to evaluate two other core elements of graphic design: typography and layout. Furthermore, current text-to-image preference datasets suffer from a scarcity of data related to graphic design, which hinders the further development of generative models in this domain.To address this gap, we have designed an automated pipeline to construct a high-quality dataset of 70k poster preferences. Subsequently, we have developed, a reward model capable of accurately assessing the quality of generated posters, by leveraging a cascaded, multi-stage training pipeline. We also provide multiple variants of the model to cater to different application scenarios. Finally, we introduce and to evaluate the performance of existing reward models in poster assessment and the capabilities of current image generation models in poster creation, respectively.
PaperID: 1598,   Poster  https://arxiv.org/pdf/2512.13276    
Authors: Yan Li, Lin Liu, Xiaopeng Zhang, Wei Xue, Wenhan Luo, Yike Guo, Qi Tian
Title: CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
Abstract: Instructionbased image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified frameworkCogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based Optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation.
PaperID: 1599,   Poster  https://arxiv.org/pdf/2411.10639    
Authors: Yunsheng Ma, Burhan Yaman, Xin Ye, Jingru Luo, Feng Tao, Abhirup Mallik, Ziran Wang, Liu Ren
Title: MTA: Multimodal Task Alignment for BEV Perception and Captioning
Abstract: Bird's eye view (BEV)based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one task and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA seamlessly integrates into state-of-the-art baselines during training, adding no extra computational complexity at runtime. Extensive experiments on the nuScenes and TOD3Cap datasets show that MTA significantly outperforms state-of-the-art baselines in both tasks, achieving a 10.7% improvement in challenging rare perception scenarios and a 9.2% improvement in captioning. These results underscore the effectiveness of unified alignment in reconciling BEV-based perception and captioning.
PaperID: 1600,   Poster  https://arxiv.org/pdf/2511.16955    
Authors: Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Title: Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDEbased GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
PaperID: 1601,   Poster  https://arxiv.org/pdf/2601.17830    
Authors: Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang
Title: VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training
Abstract: Denoisingbased diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes VAE-REPA, a lightweight intrinsic guidance framework for efficient diffusion training. VAE-REPA leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, VAE-REPA aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that VAE-REPA improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
PaperID: 1602,   Poster  https://arxiv.org/pdf/2511.21185    
Authors: Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang
Title: Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in textto-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines. The source code will be publicly released.
PaperID: 1603,   Poster  https://arxiv.org/pdf/2510.08562    
Authors: Zhiyu Zheng, Shaoyu Chen, haoran yin, xinbang zhang, Jialv Zou, Xinggang Wang, Qian Zhang, Lefei Zhang
Title: ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Abstract: Endto-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of robust driving logic, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes and simplifies the learning task by predicting the residual deviation from a deterministic inertial reference. This inertial reference serves as a strong physical prior, compelling the model to move beyond simple pattern-matching and instead focus its capacity on learning the necessary, context-driven deviations (e.g., traffic rules, obstacles) from this default, inertially-guided path. To mitigate the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. This technique re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. On the NAVSIM v1 and v2 benchmarks, ResAD achieves state-of-the-art results of 88.8 PDMS and 85.5 EPDMS with only two denoising steps, demonstrating that ResAD significantly simplifies the learning task and improves planning performance. The code will be released to facilitate further research.
PaperID: 1604,   Poster  https://arxiv.org/pdf/2603.20403    
Authors: Maxime Fontana, Michael Spratling, Miaojing Shi
Title: FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Abstract: Adapting models pretrained on large-scale datasets is a proven way to reach strong performance quickly for downstream tasks. However, the constant growth of state-of-the-art models makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for different tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that captures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrinking (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 10.3 times compared to traditional MTL fine-tuning whilst boosting performance on all tasks.
PaperID: 1605,   Poster  https://arxiv.org/pdf/2511.19773    
Authors: Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Di Jin, Wenqi Shi, Xuan Wang
Title: Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Abstract: While recent visionlanguage models (VLMs) demonstrate strong image understanding, their ability to “think with images,” i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
PaperID: 1606,   Poster  https://arxiv.org/pdf/2512.08441    
Authors: Luca Cogo, Marco Buzzelli, Simone Bianco, Javier Vazquez-Corral, Raimondo Schettini
Title: Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Abstract: Recent advances in snapshot multispectral (MS) imaging have enabled compact, lowcost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.
PaperID: 1607,   Poster  https://arxiv.org/pdf/2510.24078    
Authors: William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
Title: Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
Abstract: Textto-image (T2I) models are increasingly used for synthetic dataset generation, but generating synthetic training data to improve fine-grained classification performance remains challenging. Fine-tuning the T2I model with a few real examples can help generate more appropriate synthetic training data; however, this fine-tuning may also introduce overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (Beyond OBjects) for mitigating these concerns. Given a small set of real examples, we first describe them using class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, thus preserving the T2I model’s generative prior and reducing estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets demonstrate state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with 5 real images augmented with 100 synthetic images). Additionally, in three of the four datasets, the fine-tuning downstream models with synthetic data generated from BOB and five real images achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with over 2% accuracy improvements in 14 of these settings.
PaperID: 1608,   Poster  https://arxiv.org/pdf/2512.22065    
Authors: Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, zhengguang zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, qinglin lu, Yong-Jin Liu
Title: StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
Abstract: Realtime, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.
PaperID: 1609,   Poster  https://arxiv.org/pdf/2511.20994    
Authors: Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu
Title: GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for visionlanguage tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question–Thinking–Answer (QTA) pipeline via joint image–text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods.The codes will be made publicly available.
PaperID: 1610,   Poster  https://arxiv.org/pdf/2602.23645    
Authors: Tongyan Hua, Haoran Gong, Yuan Liu, Di Wang, Ying-Cong Chen, Wufan Zhao
Title: BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
Abstract: We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structurefrom-Motion.To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation.Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes.We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds.Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.
PaperID: 1611,   Poster  https://arxiv.org/pdf/2503.10125    
Authors: Yi Wu, Shengju Qian, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
Title: Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
Abstract: Multimodal autoregressive (AR) models, based on nexttoken prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
PaperID: 1612,   Poster  https://arxiv.org/pdf/2511.20649    
Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Kaan Akan, Kaan Oktay, Pinar Yanardag
Title: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3DRoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce \infty-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two anchor tokens, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish \infty-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that \infty-RoPE consistently surpasses previous autoregressive models in overall VBench scores while requiring only a fraction of their KV cache budget.
PaperID: 1613,   Poster  https://arxiv.org/pdf/2603.05530    
Authors: Wei Xue, Mingcheng Li, Xuecheng Wu, Jingqun Tang, Dingkang Yang, Lihua Zhang
Title: ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
Abstract: Visionand-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose ProFocus, a training-free progressive framework that unifies Proactive Perception and Focused Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-k high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.
PaperID: 1614,   Poster  https://arxiv.org/pdf/2603.07989    
Authors: Teng Wang, Yanting Lu, Ruize Wang
Title: AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
Abstract: We present AutoTraces, an autoregressive visionlanguage-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents physical waypoints as discrete tokens, seamlessly integrated into the LLM’s space through a lightweight encoder-decoder architecture. This design preserves the LLM’s native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.
PaperID: 1615,   Poster  https://arxiv.org/pdf/2512.19243    
Authors: Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun GUI, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia
Title: VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multigoal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world, we introduce Long Goal Bench(LGBench), a 2000-task suite (1000 T2I, 1000 I2I) whose average instruction contains 18---22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find even state-of-the-art commercial APIs satisfy fewer than 72% of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we presentVisionDirector, a training-free, vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling plus semantic verification/rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 vs.\ 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (+7% overall), and ImgEdit (+0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing. The code, benchmark, and evaluation scripts will be released.
PaperID: 1616,   Poster  https://arxiv.org/pdf/2603.06181    
Authors: Mingzhe Li, Mengyin Liu, Zekai Wu, Xincheng Lin, Junsheng Zhang, Ming Yan, Zengye Xie, Changwang Zhang, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang
Title: Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
Abstract: Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and humanlike. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion HHMotion dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0–5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this gap, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based approaches. The dataset, code, and benchmark will be publicly released to support future research in the community.
PaperID: 1617,   Poster  https://arxiv.org/pdf/2601.06338    
Authors: Binxu Wang, Jingxuan Fan, Xu Pan
Title: Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
Abstract: Diffusion Transformers (DiTs) have greatly advanced textto-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.
PaperID: 1618,   Poster  https://arxiv.org/pdf/2511.22147    
Authors: Yanping LI, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Title: RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
Abstract: As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denialof-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility.
PaperID: 1619,   Poster  https://arxiv.org/pdf/2601.03928    
Authors: Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
Title: FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Abstract: VisionLanguage Models (VLMs) have shown strong performance on User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4,700 for 2K resolution), which incurs significant computational overhead and dilutes attention. In contrast, humans typically focus on regions of interest when interacting with UIs. In this work, we pioneer the task of efficient UI grounding. We propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) eliminating redundant tokens in visual encoding by constructing patch-level supervision that fuses an instruction-conditioned score with a rule-based UI-graph score, down-weighting large homogeneous regions to select distinct and instruction-relevant visual tokens; and (2) preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. To address this, we introduce a PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a 3.7% performance improvement over GUI-Actor-7B. Even with only 30% visual token retention, the performance of FocusUI-7B drops by just 3.2%, while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
PaperID: 1620,   Poster  https://arxiv.org/pdf/2603.14750    
Authors: CailingHan CailingHan, Zhangbin Li, Jinxing Zhou, Wei Qian, Jingjing Hu, Yanghao Zhou, Zhangling Duan, Dan Guo
Title: Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
Abstract: Pointlevel weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To tackle the intrinsic challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network FSENet, a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach first introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We then propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At last, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings. The code will be open source to the public.
PaperID: 1621,   Poster  https://arxiv.org/pdf/2502.01572    
Authors: Yiren Song, Cheng Liu, Mike Zheng Shou
Title: MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
Abstract: A hallmark of human intelligence is the ability to create complex artifacts through structured multistep processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.
PaperID: 1622,   Poster  https://arxiv.org/pdf/2603.11566    
Authors: Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin
Title: R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
Abstract: 4D radar–camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an InstanceGuided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.
PaperID: 1623,   Poster  https://arxiv.org/pdf/2501.06138    
Authors: Arkaprava Sinha, Monish Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das
Title: MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
Abstract: Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process longduration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.
PaperID: 1624,   Poster  https://arxiv.org/pdf/2512.06332    
Authors: Jeffrey Gu, Minkyu Jeon, Ambri Ma, Serena Yeung, Ellen Zhong
Title: CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
Abstract: Cryoelectron microscopy (cryo-EM) is an indispensable technique for determining the 3D structures of dynamic biomolecular complexes. While typically applied to image a single molecular species, cryo-EM holds great potential for structure determination of many targets simultaneously in a high-throughput fashion. However, existing methods typically focus on modeling conformational heterogeneity within a single or very few structures and are not designed to resolve compositional heterogeneity arising from mixtures of many distinct molecular species. To address this challenge, we propose CryoHype, a transformer-based hypernetwork for cryo-EM reconstruction that dynamically adjusts the weights of an implicit neural representation based on the input structure. Using CryoHype, we successfully reconstruct 1,000 distinct structures from cryo-EM imaging in the fixed-pose setting without any pre-existing knowledge of the structures present, which is beyond the capabilities of any existing algorithm.
PaperID: 1625,   Poster  https://arxiv.org/pdf/2509.12546    
Authors: Yingxin Lai, Zitong YU, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
Title: Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
Abstract: Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and realworld efficacy, which we attribute to the ecological invalidity of training data. This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this, we propose a multi-agent framework where LLM-powered agents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulation-driven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.
PaperID: 1626,   Poster  https://arxiv.org/pdf/2603.03792    
Authors: Haowei Zhu, Tingxuan Huang, XING WANG, Tianyu Zhao, Jiexi Wang, Weifeng Chen, Xurui Peng, Fangmin Chen, Jun-Hai Yong, Bin Wang
Title: TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
Abstract: Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated fullmodel denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token “probe-then-select’’ strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy–efficiency frontier compared to fixed global predictors and caching-only baselines.
PaperID: 1627,   Poster  https://arxiv.org/pdf/2509.21029    
Authors: Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip H.S. Torr, Adel Bibi, Tongliang Liu
Title: FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Abstract: The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate opensource MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
PaperID: 1628,   Poster  https://arxiv.org/pdf/2601.00393    
Authors: Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang
Title: NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Abstract: In this paper, we proposeNeoVerse, a versatile 4D world model that is capable of 4D reconstruction, noveltrajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.
PaperID: 1629,   Poster  https://arxiv.org/pdf/2511.22177    
Authors: Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Title: Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Abstract: Most posttraining methods for text-to-image samplers focus on the model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency.We take a different route: rescheduling the sampling timeline of a frozen sampler.Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James–Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior results.Our rescheduled samplers consistently improve text–image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families.Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.
PaperID: 1630,   Poster  https://arxiv.org/pdf/2602.19430    
Authors: Dong-Guw Lee, Tai Hyoung Rhee, Hyunsoo Jang, Young-Sik Shin, Ukcheol Shin, Ayoung Kim
Title: TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
Abstract: Despite the inherent advantages of thermal infrared(TIR) imaging, largescale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
PaperID: 1631,   Poster  https://arxiv.org/pdf/2512.06802    
Authors: Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen
Title: VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for realworld applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to advance the field. Experiments demonstrate that our 4-step VDOT outperforms or matches the performance of other baselines with 100 denoising steps.
PaperID: 1632,   Poster  https://arxiv.org/pdf/2602.21105    
Authors: Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo
Title: BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
Abstract: The boundary representation (Brep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.
PaperID: 1633,   Poster  https://arxiv.org/pdf/2603.08174    
Authors: Junyu Shen, Zhendong She, Chenghanyu Zhang, Yuchuang Sun, Luqing Luo, Dingwei Tan, Zonghao Guo, Bo Guo, Zehua Han, Wupeng Xie, Yaxin Mu, Peng Zhang, Pei Pei Li, Fengxiang Wang, Yangang Sun, Maosong Sun
Title: MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
Abstract: The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using taskspecific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.
PaperID: 1634,   Poster  https://arxiv.org/pdf/2510.23299    
Authors: HAOCHEN ZHAO, Yuyao Kong, Yongxiu Xu, Gaopeng Gou, Hongbo Xu, Yubin Wang, Haoliang Zhang
Title: MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on singleimage scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose a Cross-Image Reasoning Model (CIRM), integrating a Dual-Stage Bridge Module and Relevance-Guided Fusion Module to model inter-image dependencies and cross-modal correspondences. Complementarily, we establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0, and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.
PaperID: 1635,   Poster  https://arxiv.org/pdf/2603.21528    
Authors: Gensheng Pei, Xiruo Jiang, Xinhao Cai, Tao Chen, Yazhou Yao, Byeungwoo Jeon
Title: PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
Abstract: Trainingfree open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity.We present PEARL, \underlineProcrust\underlinees \underlinealignment with text-awa\underlinere \underlineLaplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.
PaperID: 1636,   Poster  https://arxiv.org/pdf/2512.02369    
Authors: Qingmei Li, Yang Zhang, peifeng zhang, Haohuan Fu, Juepeng Zheng
Title: SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
Abstract: Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many realworld scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a Style-Adaptive GEneralization framework (SAGE), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.
PaperID: 1637,   Poster  https://arxiv.org/pdf/2603.09673    
Authors: Anh Thuan Tran, Jana Kosecka
Title: VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
Abstract: Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and highfidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.
PaperID: 1638,   Poster  https://arxiv.org/pdf/2603.01111    
Authors: Yiming Ma, Hongkun Yang, Lionel WANG, BIN CHEN, Weizhi Xian, Jianzhi Teng
Title: DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
Abstract: Prompt learning is a dominant paradigm for adapting pretrained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose DeAR, a framework that achieves fine-grained VLM adaptation by Decomposing Attention head Roles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: Attribute, Generalization, and Mixed. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
PaperID: 1639,   Poster  https://arxiv.org/pdf/2602.12769    
Authors: Phuc Lai, Phong Nguyen, Anh Tran
Title: PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
Abstract: Pretrained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image.In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10× to 35× speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.
PaperID: 1640,   Poster  https://arxiv.org/pdf/2602.21499    
Authors: Shimin Hu, Yuanyi Wei, Fei Zha, Yudong Guo, Juyong Zhang
Title: Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
Abstract: Existing 3D editing methods rely on computationally intensive sceneby-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore photorealistic details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.
PaperID: 1641,   Poster  https://arxiv.org/pdf/2602.23295    
Authors: Ayush Roy, Wei-Yang Lee, Rudrasis Chakraborty, Vishnu Lokhande
Title: ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
Abstract: In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of largescale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features—yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, \ell_2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
PaperID: 1642,   Poster  https://arxiv.org/pdf/2502.01117    
Authors: Yunchuan Guan, Yu Liu, Ke Zhou, Zhiqi Shen, Jenq-Neng Hwang, Lei Li
Title: Learn to Learn Weight Generation via Local Consistency Diffusion
Abstract: generation. However, existing solutions are limited by two challenges: generalizability and missing local supervision targets. The first challenge stems from the inherent lack of crosstask transferability in existing single-level optimization methods, which limits model performance on new tasks. The latter challenge lies in existing research modeling only global optimal weights, neglecting the supervision signals in local target weights. Furthermore, naively assigning local target weights leads to inconsistency between local and global objectives. To address these issues, we propose Mc-Di, which integrates the diffusion algorithm with meta-learning for better generalizability. Additionally, we extend the vanilla diffusion into a local consistency diffusion algorithm. Our theoretical analysis and experimental results demonstrate that the model can learn from local targets while preserving consistency with the global optimum. We validate Mc-Di's superior accuracy and inference efficiency on tasks that require frequent weight updates, including transfer learning, few-shot learning, domain generalization, and language model fine-tuning.
PaperID: 1643,   Poster  https://arxiv.org/pdf/2505.17015    
Authors: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin Liang
Title: Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Abstract: Multimodal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
PaperID: 1644,   Poster  https://arxiv.org/pdf/2512.18314    
Authors: Philipp Langsteiner, Jan-Niklas Dihlmann, Hendrik Lensch
Title: MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
Abstract: Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learningbased and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.
PaperID: 1645,   Poster  https://arxiv.org/pdf/2512.16234    
Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian
Title: ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
Abstract: 3D human reaction generation faces three main challenges: (1) high motion fidelity, (2) realtime inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions. Code is available in the supplementary.
PaperID: 1646,   Poster  https://arxiv.org/pdf/2602.22695    
Authors: Yu Chen, Zewei He, Xingyu Liu, Zixuan Chen, Zhe-Ming Lu
Title: GFRRN: Explore the Gaps in Single Image Reflection Removal
Abstract: Prior dualstream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.
PaperID: 1647,   Poster  https://arxiv.org/pdf/2512.01426    
Authors: Yiyang Ma, Feng Zhou, Xuedan Yin, Pu Cao, Yonghao Dang, Jianqin Yin
Title: ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Abstract: Leveraging pretrained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.
PaperID: 1648,   Poster  https://arxiv.org/pdf/2512.19402    
Authors: Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen, Xiaoqi Li, Guanghui Ren, Hao Dong
Title: Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Abstract: Recent progress in robot learning has been driven by largescale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1–5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50×. Moreover, experimental results on height and texture editing demonstrate the framework’s flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
PaperID: 1649,   Poster  https://arxiv.org/pdf/2512.03350    
Authors: Yu Yuan, Tharindu Wickremasinghe, Zeeshan Nadir, Xijun Wang, Yiheng Chi, Stanley H. Chan
Title: SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
Abstract: Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D\to4D\to2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D\to4D). It then learns the continuous 4D dynamics on a lowrank representation and physical constraints (discrete 4D\tocontinuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D\to2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.
PaperID: 1650,   Poster  https://arxiv.org/pdf/2511.19972    
Authors: Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang
Title: Boosting Reasoning in Large Multimodal Models via Activation Replay
Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this posttraining paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.
PaperID: 1651,   Poster  https://arxiv.org/pdf/2603.17828    
Authors: Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, Liqiang Nie
Title: TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
Abstract: Although textto-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content.This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods.However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist.To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found.However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model.To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses.Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance.Our experiments demonstrate that TINA successfully regenerates erased concepts from models treated with state-of-the-art unlearning.The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.Code will be released upon acceptance.
PaperID: 1652,   Poster  https://arxiv.org/pdf/2512.01738    
Authors: Pedro Curvo, Jan-Willem van de Meent, Maksim Zhdanov
Title: MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
Abstract: A key scalability challenge in neural solvers for industrialscale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate MSPT on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.
PaperID: 1653,   Poster  https://arxiv.org/pdf/2604.07965    
Authors: Gyanendra Das, Sai Jena
Title: DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
Abstract: Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and crossmodal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine-tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non-relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision-language representations. This process structurally isolates concepts, enabling precise, non-interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi-term loss function for maintaining task fidelity, edit locality, and cross-modal alignment. With the base model frozen, our method achieves 98% single-edit success, remains over 95% after 1,000 sequential edits, lowers hallucination by 3-5%, and achieves the best backward transfer (BWT) scores on continual instruction-tuning benchmarks. Extensive experiments demonstrate DSCA’s state-of-the-art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.
PaperID: 1654,   Poster  https://arxiv.org/pdf/2505.22344    
Authors: Nikhil Behari, Aaron Young, Tzofi Klinghoffer, Akshat Dave, Ramesh Raskar
Title: Task-Driven Implicit Representations for Automated Design of LiDAR Systems
Abstract: Imaging system design is a complex, timeconsuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.
PaperID: 1655,   Poster  https://arxiv.org/pdf/2511.16786    
Authors: Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen
Title: Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
Abstract: Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV cache compression from the perspective of the KV matrices’ distribution. First, we observe that frequencydomain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain–guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69× faster decoding with 80% lower KV memory usage while maintaining task performance.
PaperID: 1656,   Poster  https://arxiv.org/pdf/2511.14270    
Authors: Yiming Zeng, Xile Zhao, Wei-Hao Wu, Teng-Yu Ji, Chao Wang
Title: Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
Abstract: Tensor singular value decomposition (tSVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address the two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional image. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.
PaperID: 1657,   Poster  https://arxiv.org/pdf/2511.21251    
Authors: Shuhan Xia, Pei Pei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo, Zekun Li
Title: AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Abstract: The threat of AudioVideo (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
PaperID: 1658,   Poster  https://arxiv.org/pdf/2512.02686    
Authors: Yuxing Liu, Zheng Li, Huanhuan Liang, Ji Zhang, Zeyu Sun, Yong Liu
Title: ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
Abstract: Anomaly segmentation seeks to detect and localize unknown or outof-distribution (OoD) objects that fall outside predefined semantic classes—a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions.Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.
PaperID: 1659,   Poster  https://arxiv.org/pdf/2512.19554    
Authors: Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang
Title: CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning
Abstract: Grouprelative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has—the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling(RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
PaperID: 1660,   Poster  https://arxiv.org/pdf/2511.16651    
Authors: Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, Jiangmiao Pang
Title: InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Abstract: Recent work explores how real and synthetic data contribute to VLA model generalization. While the \piseries model has shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale.This paper provides the first evidence that synthetic data alone can match the performance of the strongest \pi-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation.The resulting model also exhibits surprisingly strong zero-shot sim-to-real transfer on several challenging tasks.Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning.Using the same architecture as \pi_0, we pre-train a model entirely on InternData-A1 and find that it matches the official \pi_0 across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks.We will open-source both the dataset and the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.
PaperID: 1661,   Poster  https://arxiv.org/pdf/2603.01650    
Authors: Xianqi Wang, Hao Yang, Hangtian Wang, JunDa Cheng, Gangwei Xu, Min Lin, Xin Yang
Title: PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Abstract: Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zeroshot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.
PaperID: 1662,   Poster  https://arxiv.org/pdf/2511.12207    
Authors: Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam Woh Ng, Juan Camilo Perez, Juan-Manuel Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber
Title: Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Abstract: We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, statebased interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an \epsilon-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4× larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
PaperID: 1663,   Poster  https://arxiv.org/pdf/2509.03498    
Authors: Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong
Title: OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a single decoderonly transformer architecture. OneCAT uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution image inputs and outputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) design trained with a unified autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer to achieve multi-scale visual autoregressive mechanism within the Large Language Model (LLM) with proposed scale-aware adapter (SAA) that drastically reduces decoding latency compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as an elegant foundation for unified multimodal intelligence. As a result, OneCAT outperforms existing unified models across benchmarks for multimodal understanding, generation, and editing.
PaperID: 1664,   Poster  https://arxiv.org/pdf/2508.01603    
Authors: Yiheng Li, Zichang Tan, Guoqing Xu, Zhen Lei, Xu Zhou, Yang Yang
Title: Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
Abstract: In AIgenerated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively.
PaperID: 1665,   Poster  https://arxiv.org/pdf/2512.10554    
Authors: Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma
Title: Grounding Everything in Tokens for Multimodal Large Language Models
Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can image tokenization be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive Transformer architecture. Extensive experiments demonstrate that GETok achieves superior performance over the stateof-the-art methods across various referring tasks in both supervised and reinforcement learning contexts.
PaperID: 1666,   Poster  https://arxiv.org/pdf/2512.00422    
Authors: Yingxuan You, Chen Zhao, Hantao Zhang, Ming Xu, Pascal Fua
Title: PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
Abstract: Existing generative models for 3D shapes can synthesize highfidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We further strengthen physical validity by incorporating a physics-aware regularization term into the velocity-based update step. To support such physics-guided updates, we build a shape-and-physics variational autoencoder (SP-VAE) that jointly encodes shape and physics information into a unified latent space. The experiments on three benchmarks show that this synergistic formulation improves shape realism beyond mere visual plausibility.
PaperID: 1667,   Poster  https://arxiv.org/pdf/2512.14336    
Authors: Jooyeol Yun, Jaegul Choo
Title: Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Abstract: Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic.Yet automating the animation of vector graphics remains challenging for vision–language models (VLMs) despite recent progress in code generation and motion planning.VLMs routinely mishandle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions.By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
PaperID: 1668,   Poster  https://arxiv.org/pdf/2512.22351    
Authors: Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo
Title: VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D visionlanguage tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines.
PaperID: 1669,   Poster  https://arxiv.org/pdf/2603.26260    
Authors: Xujing Tao, Chuxin Wang, Yubo Ai, Zhixin Cheng, Zhuoyuan Li, Liangsheng Liu, Yujia Chen, Xinjun Li, Qiao Li, Wenfei Yang, Tianzhu Zhang
Title: GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
Abstract: Openvocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set.Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift.Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.
PaperID: 1670,   Poster  https://arxiv.org/pdf/2511.09829    
Authors: Jiahuan Long, Tingsong Jiang, Hanqing Liu, Chao Ma, Weien Zhou, Yang Yang, Wen Yao
Title: Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
Abstract: Adversarial patches have emerged as a popular privacypreserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.
PaperID: 1671,   Poster  https://arxiv.org/pdf/2601.16914    
Authors: Jiaxing Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
Title: LoL: Longer than Longer, Scaling Video Generation to Hour
Abstract: Recent research in longform video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
PaperID: 1672,   Poster  https://arxiv.org/pdf/2512.03404    
Authors: Yujian Zhao, Hankun Liu, Guanglin Niu
Title: MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
Abstract: Crossmodal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical–SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL,Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.
PaperID: 1673,   Poster  https://arxiv.org/pdf/2512.09056    
Authors: Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Paudel, Benjamin Busam
Title: ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Abstract: Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, datasetspecific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities.In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training. We will release our code upon acceptance.
PaperID: 1674,   Poster  https://arxiv.org/pdf/2604.07053    
Authors: Xiaoxue Zhang, Xiaoxu Zheng, Yixuan Yin, Tiao Zhao, Kaihua Tang, Michael Bi Mi, Zhan Xu, Dave Zhenyu Chen
Title: AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
Abstract: Scenelevel 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned prediction and instead represents the scene directly in 3D space. AnchorSplat introduces anchor-aligned Gaussians guided by geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware representation that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. The framework is trained in two stages: a Gaussian decoder first predicts anchor-aligned Gaussians, and a subsequent Gaussian refiner further improves their quality and view consistency. Experiments on the ScanNet benchmark demonstrate that AnchorSplat achieves state-of-the-art performance, producing more view-consistent and plausible 3D Gaussian reconstructions. Code, videos, and pretrained models will be released on the project page.
PaperID: 1675,   Poster  https://arxiv.org/pdf/2511.10971    
Authors: Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Heng Ping, Tamoghna Chattopadhyay, Sophia Thomopoulos, Shahin Nazarian, Paul Thompson, Paul Bogdan
Title: ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
Abstract: Mixtureof-Experts (MoE) architectures expand model capacity by sparsely activating experts, but suffer from two core challenges: misalignment between router logits and each expert’s internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an Eigenbasis Score—the cosine similarity between input features and an expert’s basis. This content-aware routing ties token assignments directly to experts’ representation spaces, inherently stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE eliminates the need for explicit balancing losses and avoids the interfering gradients they introduce. We demonstrate that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by over 7% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models, directly addressing core routing instabilities and enabling improved performance with scalable, interpretable specialization.
PaperID: 1676,   Poster  https://arxiv.org/pdf/2511.16175    
Authors: Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng
Title: Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Abstract: Recent advances in VisionLanguage-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.5% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms \pi_0.5, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. We also introduce the adaptive temporal ensemble (ATE) strategy to balance computational efficiency and motion stability during inference, yielding the Mantis-ATE variant, which reduces inference counts by 45% while maintaining performance. The code and weights will be open-sourced after acceptance.
PaperID: 1677,   Poster  https://arxiv.org/pdf/2409.19289    
Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jianlu Shen, Jing Wang, Yong Rui, Xin Geng
Title: FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
Abstract: The training of diffusion models is computationally intensive, making effective pretraining essential. However, real-world deployments often demand models of variable sizes due to diverse memory and computational constraints, posing challenges when corresponding pre-trained versions are unavailable.To address this, we propose FINE, a novel pre-training method whose resulting model can flexibly factorize its knowledge into fundamental components, termed learngenes, enabling direct initialization of models of various sizes and eliminating the need for repeated pre-training.Rather than optimizing a conventional full-parameter model, FINE represents each layer’s weights as the product of U_\star, \Sigma_\star^(l), and V_\star^\top, where U_\star and V_\star serve as size-agnostic learngenes shared across layers, while \Sigma_\star^(l) remains layer-specific.By jointly training these components, FINE forms a decomposable and transferable knowledge structure that allows efficient initialization through flexible recombination of learngenes, requiring only light retraining of \Sigma_\star^(l) on limited data.Extensive experiments demonstrate the efficiency of FINE, achieving state-of-the-art performance in initializing variable-sized models across diverse resource-constrained deployments. Furthermore, models initialized by FINE effectively adapt to diverse tasks, showcasing the task-agnostic versatility of learngenes.
PaperID: 1678,   Poster  https://arxiv.org/pdf/2602.12640    
Authors: Peijie Qiu, Hariharan Ramshankar, Arnau Ramisa, Amit C C, Rene Vidal, Vamsi Salaka, Rahul Bhagat
Title: ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
Abstract: Diffusion models have emerged as the leading approach for textto-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space (\mathcalH-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the \mathcalH-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.
PaperID: 1679,   Poster  https://arxiv.org/pdf/2603.13740    
Authors: Zengyan Wang, Sirshapan Mitra, rajat modi, Hui Xian Grace Lim, Yogesh Rawat
Title: Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Abstract: In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depthmaps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities together.We propose a restricted-attention mechanism, termed as `Masked-Satellite-Attention' which prevents ground/aerial images from interacting with satellite images. Further, our SkyNet is optimized with strategies inspired from curriculum-learning: sampling cameras which are far-away from each other during training. Extensive experiments on our Sky2Earth dataset reveal that SkyNet outperforms existing methods by 23% in terms of absolute performance. Our dataset, and code shall be made publicly available on huggingface.
PaperID: 1680,   Poster  https://arxiv.org/pdf/2512.07558    
Authors: Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Title: ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to premature policy convergence, resulting in early exploitation and performance saturation. While manipulating tokenlevel entropy has proven effective for promoting early exploration, we argue that the latent dynamics underlying token generation provide richer computational structure for guiding policy optimization. To characterize the nonlinear latent structure of LRM and further facilitate measurement and manipulation on a tractable representation space, we leverage the Koopman operator theory to linearize the hidden state dynamics. We then introduce a new metric, Dynamic Spectral Dispersion (DSD),to quantify the diversity of the model's reasoning dynamics, which also serves as a direct measure of the degree of exploration. Building upon these foundations, we introduce a latent dynamics aware training paradigm, Reasoning with Latent eXploration (ReLaX), to attain a better balance between exploration and exploitation during policy optimization. With the proposed ReLaX, we achieve state-of-the-art results across 7 multimodal benchmarks and multidisciplinary reasoning benchmarks. Furthermore, comparative analysis reveals that ReLaX's mechanism of adaptive and semantically meaningful exploration cultivates more structured and robust reasoning than methods that merely optimize for token-level entropy.
PaperID: 1681,   Poster  https://arxiv.org/pdf/2512.19692    
Authors: Pablo Ruiz-Ponce, Sergio Escalera, Jose Garcia-Rodriguez, Jiankang Deng, Rolandos Alexandros Potamias
Title: Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
Abstract: Generating realistic humanhuman interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.
PaperID: 1682,   Poster  https://arxiv.org/pdf/2511.21579    
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, zhengguang zhou, Youliang Zhang, Yuan Zhou, qinglin lu, Ran Yi
Title: Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Abstract: The synthesis of synchronized audiovisual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization.To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
PaperID: 1683,   Poster  https://arxiv.org/pdf/2511.15613    
Authors: Jing Bi, Filippos Bellos, JunJia Guo, Yayuan Li, Chao Huang, Yunlong Tang, Luchuan Song, Susan Liang, Zhongfei Zhang, Jason Corso, Chenliang Xu
Title: When to Think and When to Look: Uncertainty-Guided Lookback
Abstract: Testtime “thinking” (i.e., generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision–language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large-scale, controlled comparison of thinking for LVLMs, evaluating 10 variants from the InternVL3.5 and Qwen3-VL families on MMMUval under generous token budgets and multi-pass decoding. We show that more thinking is not always better: long chains often yield long-wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short “lookback” phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty-guided lookback, a training-free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math-focused visual reasoning datasets.
PaperID: 1684,   Poster  https://arxiv.org/pdf/2602.19668    
Authors: He Zhu, Ren Togo, Takahiro Ogawa, Kenji Hirata, Minghui Tang, Takaaki Yoshimura, Hiroyuki Sugimori, Noriko Nishioka, Yukie Shimizu, Kohsuke Kudo, Miki Haseyama
Title: Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation
Abstract: Automatic medical report generation from multimodal longitudinal imaging is crucial for clinical diagnosis but remains challenging due to privacy constraints and evolving disease dynamics. While federated learning (FL) enables decentralized model training without data sharing, its extension to longitudinal medical modeling remains underexplored. Existing FL approaches overlook temporal nonstationarity across visits and patient-specific heterogeneity, causing unstable optimization and degraded report quality.We introduce Federated Temporal Adaptation (FTA), a new FL setting for longitudinal medical report generation, and propose FedTAR, a framework combining parameter-efficient personalization and meta-learned temporal aggregation. FedTAR employs a metadata-conditioned LoRA module that generates patient-specific adapters from Gaussian-mixture embeddings and a residual temporal aggregation scheme that adaptively weights client updates via first-order MAML, ensuring stable and efficient optimization under temporal heterogeneity.Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust, privacy-preserving paradigm for federated multimodal longitudinal modeling.
PaperID: 1685,   Poster  https://arxiv.org/pdf/2511.19221    
Authors: Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, Minzhe Niu, Haojie Zhu, Qichao Dong, Xuechao Yan, Siyuan Dong, Lu Hou, Qingqiu Huang, Xiaosong Jia, Hang Xu
Title: Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Abstract: Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in longtail scenarios and complex interactions. However, current vision–language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability.To address these challenges, we introduce Percept-WAM, a perception-enhanced World–Awareness–Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM).Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence.We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM.Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
PaperID: 1686,   Poster  https://arxiv.org/pdf/2603.16133    
Authors: Xiaoxu Meng, Zhongmin Chen, Bo Yang, Weikai Chen, Weixiao Liu, Lin Gao
Title: DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
Abstract: We present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multiview images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative superquadrics under view-consistent supervision. Extensive experiments demonstrate that DualPrim outperforms state-of-the-art methods in reconstruction accuracy while producing more geometrically concise, interpretable, and high-fidelity 3D meshes.
PaperID: 1687,   Poster  https://arxiv.org/pdf/2603.02123    
Authors: Jiahao Huang, Fengyan Lin, Xuechao Yang, Feng Chen, Kexin Zhu, Xu Yang, Zhide chen
Title: Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Abstract: The development of affective multimodal language models (MLMs) has long been constrained by a gap between lowlevel perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth—perception, understanding, and interaction—and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduceNano-EmoX, a small-scale multitask MLM, andP2E(Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
PaperID: 1688,   Poster  https://arxiv.org/pdf/2511.17962    
Authors: Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian, Wang Jiarui, Zijian Chen, Guangtao Zhai, Xiongkuo Min
Title: VITAL: Vision-Encoder-centered Pretraining for LMMs in Visual Quality Assessment
Abstract: Developing a robust visual quality assessment (VQualA) large multimodal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on fullparameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs.(1) We adopt a machine-executed annotation–scrutiny paradigm, constructing over 4.5M vision–language (VL) pairs—the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model’s quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize the efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.
PaperID: 1689,   Poster  https://arxiv.org/pdf/2511.18174    
Authors: Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, Laszlo Jeni
Title: Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Any Camera
Abstract: Modern perception increasingly relies on fisheye, panoramic, and other wideFoV cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that lifts images from any calibrated camera to a unit-sphere representation via ray-direction correspondences and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, value interpolation, and output resolution controls are decoupled. Its distance-only spherical kernels provide configurable rotation-equivariance by design (mirroring the translation-equivariance of planar CNNs), while avoiding harmonic transforms. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D3DS), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains \emphless than 1% performance drop under random test-time rotations, even without rotational augmentation, while zero‑shot generalizing from one lens type to previously unseen wide-FoV lenses with minimal degradation.
PaperID: 1690,   Poster  https://arxiv.org/pdf/2603.14644    
Authors: Hongyi Pan, Gorkem Durak, Halil Aktas, Andrea Bejar, Baver Tutun, Emre Uysal, Ezgi Bülbül, Mehmet Doğan, Berrin Erok, Berna Yildirim, Sukru M Erturk, Ulas Bagci
Title: LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
Abstract: Publicly available fullfield digital mammography (FFDM) datasets remain limited in size, clinical labels, and vendor diversity, which hinders the training of robust models. We present LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to expose clinically relevant appearance shifts that current benchmarks overlook. This innovative resource comprises 1824 images from 468 patients (960 benign, 864 malignant) with pathology-confirmed outcomes, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and both high- and low-energy styles, exposing vendor- and energy-driven appearance shifts. To reduce cross‑vendor/energy drift while preserving lesion morphology, we introduce a foreground‑only, pixel‑space alignment (“energy harmonization”) that aligns each image to a low‑energy reference style, leaving the zero‑valued background unchanged. By benchmarking modern CNN and transformer baselines on three clinically meaningful tasks—diagnosis (benign vs. malignant), BI‑RADS risk grouping, and density—we unify single‑vs‑two‑view evaluation and show that two‑view models consistently outperform single‑view; in our benchmark, EfficientNet‑B0 (512^2) attains AUC 93.61% for diagnosis, and Swin‑T yields the best macro‑AUC 89.10% for density. Harmonization improves AUC/ACC across backbones and yields more focal Grad‑CAM localization around suspicious regions. Being a richly annotated resource, LUMINA thus provides (a) a vendor‑diverse, energy‑labeled benchmark and (b) a model‑agnostic harmonization protocol that together catalyze reliable, deployable mammography AI.
PaperID: 1691,   Poster  https://arxiv.org/pdf/2603.16129    
Authors: Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao
Title: Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
Abstract: Zeroshot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation. To address these challenges, we present QICA, a novel framework that synergizes quantity perception with robust spatial cast aggregation. Specifically, we introduce a Synergistic Prompting Strategy (SPS) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (CAD) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss (\mathcalL_MQA) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains. Code is provided in the appendix.
PaperID: 1692,   Poster  https://arxiv.org/pdf/2604.01600    
Authors: Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo
Title: MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chartto-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not reveal the model in a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO) . The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.
PaperID: 1693,   Poster  https://arxiv.org/pdf/2601.06378    
Authors: Hao Zhang, Jiahao Luo, Bohui Wan, Yizhou Zhao, Zongrui Li, Michael Vasilkovsky, Chaoyang Wang, Jian Wang, Narendra Ahuja, Bing Zhou
Title: RigMo: Unifying Rig and Motion Learning for Generative Animation
Abstract: Recent progress in 4D generation has advanced the reconstruction of dynamic geometry, yet the modeling of rig and motion, the two core elements of animation, remains disconnected. Existing approaches typically treat rigging and motion generation as independent tasks: autorigging methods rely on human-annotated skeletons and skinning weights, while motion-generation models predict dense vertex trajectories without any explicit structure. This separation contradicts the nature of animation itself, which is the coupled outcome of both structure and motion, and it limits scalability, interpretability, and control.We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences without any rig annotations or human priors. RigMo encodes per-vertex deformations into a compact latent space and decodes a set of implicit Gaussian bones, skinning weights, and time-varying transformations that together define an animatable mesh. This design makes the model animatable by construction: a single latent representation yields both an explicit rig structure and temporally coherent motion parameters. Unlike optimization-based auto-rigging methods that overfit to a specific sequence, RigMo generalizes across object categories and motion styles, offering feed-forward inference for arbitrary deformable objects. Experiments on DeformingThings4D, Objaverse-XL, and diverse human and animal datasets demonstrate that RigMo generates smooth, interpretable, and physically consistent rigs, achieving superior reconstruction and generalization compared to existing 4D generative baselines. RigMo establishes a new paradigm for structure-aware, controllable, and scalable 4D generation.
PaperID: 1694,   Poster  https://arxiv.org/pdf/2512.01755    
Authors: Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie, Wei Liu, Qing Li, Xudong Mao
Title: FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Abstract: Instructionbased image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines. Code will be made publicly available.
PaperID: 1695,   Poster  https://arxiv.org/pdf/2603.20782    
Authors: Jiaxin Cheng, Yue Wu, Yicong Zhou
Title: MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
Abstract: Learningbased edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.
PaperID: 1696,   Poster  https://arxiv.org/pdf/2507.08422    
Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
Title: Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
Abstract: Diffusion transformers (DiTs) offer excellent scalability for highfidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that na\"ive latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0× speedup on FLUX-1.dev and 3.0× on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9× speedup.
PaperID: 1697,   Poster  https://arxiv.org/pdf/2603.29036    
Authors: Yujin Ham, Junho Kim, Vivek Boominathan, Guha Balakrishnan
Title: Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
Abstract: Egocentric ``walking tour'' videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eyelevel camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D Gaussian Splatting models of urban locations which was otherwise not possible from the original clips.
PaperID: 1698,   Poster  https://arxiv.org/pdf/2512.23273    
Authors: Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu
Title: YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
Abstract: Existing RealTime Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance.To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference.Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.
PaperID: 1699,   Poster  https://arxiv.org/pdf/2602.11448    
Authors: Nghia Nguyen, Tianjiao Ding, Rene Vidal
Title: Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
Abstract: Interpretableby-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding and Pursuit (HCEP), a framework that induces a hierarchy of concept vectors in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a semantic hierarchy of concepts, we construct a corresponding hierarchy of concept vectors and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept vectors, whereas vanilla sparse coding fails. Our experiments demonstrate that HCEP outperforms baselines on real-world datasets in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results suggest that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.
PaperID: 1700,   Poster  https://arxiv.org/pdf/2503.23359    
Authors: Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, HAO ZHANG, Jiayi Ma
Title: VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion
Abstract: Compared to images, videos better align with realworld acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with 220 temporally synchronized and spatially registered infrared-visible videos comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference.
PaperID: 1701,   Poster  https://arxiv.org/pdf/2603.24984    
Authors: Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim
Title: MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
Abstract: Mixtureof-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that encourages the router to assign tokens to the appropriate modality-specific experts during training. Extensive experiments on multimodal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and improving generalization. To the best of our knowledge, this is the first work to explicitly optimize the expert selection policy through RL.
PaperID: 1702,   Poster  https://arxiv.org/pdf/2603.13912    
Authors: Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang
Title: Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Abstract: Humans develop visual intelligence through perceiving and interacting with their environment—a selfsupervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations.This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion.We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality.On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, is potential to lay a foundation for achieving robust visual abstraction for embodied intelligence.
PaperID: 1703,   Poster  https://arxiv.org/pdf/2512.06006    
Authors: Xuefei Wang, Kai A. Horstmann, Ethan Lin, Jonathan Chen, Alexander Farhang, Sophia Stiles, Atharva Sehgal, Jonathan Light, David Valen, Yisong Yue, Jennifer J. Sun
Title: Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
Abstract: Adapting productionlevel computer vision tools to bespoke scientific datasets is a critical ''last mile'' bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.
PaperID: 1704,   Poster  https://arxiv.org/pdf/2512.09327    
Authors: Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, YICHEN PENG, Bo Zheng
Title: UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Abstract: Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening.However, modeling the listener is exceptionally challenging: direct audiodriven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener.This design is not end-to-end, thereby hindering the real-time applicability.To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio.Our method introduces a novel two-stage training paradigm.Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues.Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
PaperID: 1705,   Poster  https://arxiv.org/pdf/2505.22499    
Authors: Aixuan Li, Mochu Xiang, Bosen Hou, Zhexiong Wan, Jing Zhang, Yuchao Dai
Title: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
Abstract: Adversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (e.g.attaching patches), making them unrealistic and impractical for realworld evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize adversarial object appearance using a BEV spatial feature-guided optimization strategy that attacks the detector's internal representations. Extensive experiments demonstrate that our learned universal adversarial objects can consistently degrade multiple BEV detectors from various viewpoints and distances.More importantly, the new environment-manipulation attack paradigm exposes models' over-reliance on contextual cues and provides a practical pipeline for robustness evaluation in AD systems.
PaperID: 1706,   Poster  https://arxiv.org/pdf/2512.18897    
Authors: Dmitry Demidov, Muhammad Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer
Title: Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
Abstract: Vocabularyfree fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem remain limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system oper- ates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art per- formance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous ap- proaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vo- cabularies define an upper bound. Ablations further confirm that advanced prompting techniques and built-in rea- soning mechanisms significantly enhance naming quality. Additionally, we show that carefully engineered prompts enable open-source LMMs to match proprietary counter- parts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code and relevant prompting guidelines will be released.
PaperID: 1707,   Poster  https://arxiv.org/pdf/2603.12989    
Authors: Zhifang Zhang, Yang Bojun, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu
Title: Test-Time Attention Purification for Backdoored Large Vision Language Models
Abstract: Despite the strong multimodal performance, large vision–language models (LVLMs) are vulnerable during finetuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context — a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual–text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.
PaperID: 1708,   Poster  https://arxiv.org/pdf/2603.06081    
Authors: Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou, Faguo Wu, Hongwei Zheng, wenjun wu, Zhaoxin Fan
Title: Lyapunov Probes for Hallucination Detection in Large Foundation Models
Abstract: We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge—transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivativebased stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone areas. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.
PaperID: 1709,   Poster  https://arxiv.org/pdf/2510.04225    
Authors: Yikun Ji, Yan Hong, Bowen Deng, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Title: Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
Abstract: The rapid growth of AIgenerated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.
PaperID: 1710,   Poster  https://arxiv.org/pdf/2603.22893    
Authors: ZhiCheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, Zhan Xu
Title: SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Abstract: We propose SLARM, a feedforward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.
PaperID: 1711,   Poster  https://arxiv.org/pdf/2511.17844    
Authors: Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov
Title: Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Abstract: Finetuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire.In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.
PaperID: 1712,   Poster  https://arxiv.org/pdf/2603.29186    
Authors: Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki
Title: SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Abstract: We introduce SLVMEval, a benchmark for metaevaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework.Building on dense video captioning datasets, we synthetically degrade source videos to create controlled ``high-quality vs. low-quality'' pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7%--96.8% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.
PaperID: 1713,   Poster  https://arxiv.org/pdf/2603.24953    
Authors: ZeBin Ji, Yang Hu, Xiuli Bi, Bo Liu, Bin Xiao
Title: Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
Abstract: It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decisionmaking mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network’s decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select–Hypothesize–Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron’s well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.
PaperID: 1714,   Poster  https://arxiv.org/pdf/2603.26285    
Authors: Saurabh Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz
Title: PhysVid: Physics Aware Local Conditioning for Generative Video Models
Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real‑world settings. Prior attempts to inject physics rely on conditioning: frame‑level signals are domain‑specific and short‑horizon, while global text prompts are coarse and noisy, missing fine‑grained dynamics. We present PhysVid, a physics‑aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics‑grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk‑aware cross‑attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by \approx 33% over baseline video generators, and by up to \approx 8% on VideoPhy2. These results show that local, physics‑aware guidance substantially increases physical plausibility in generative video and marks a step toward physics‑grounded video models.
PaperID: 1715,   Poster  https://arxiv.org/pdf/2601.16210    
Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou
Title: PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Abstract: Discrete video VAEs underpin modern textto-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatial-temporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video instance segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
PaperID: 1716,   Poster  https://arxiv.org/pdf/2509.13688    
Authors: James Hu, Yuxiao Wu, Youcheng Cai, Ligang Liu
Title: CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
Abstract: Controllable, highfidelity mesh editing remains a significant challenge in the domain of 3D content creation. Existing generative methods often struggle with complex geometries and fail to preserve fine-scale details. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation based on Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D image editing and 3D generative modeling: we first edit a 2D reference image, then generate a 3D mesh corresponding to the edited region, and fuse it seamlessly into the original mesh through a Joint Geometry and Appearance Fusion framework built on a hybrid SDF/Mesh representation to enable Poisson Geometry Blending and Poisson Texture Harmonization. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering improved structural consistency, richer local geometric and appearance details in challenging editing scenarios. The implementation will be released publicly upon acceptance.
PaperID: 1717,   Poster  https://arxiv.org/pdf/2602.19679    
Authors: Hyeongjin Nam, Daniel Jung, Kyoung Mu Lee
Title: TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
Abstract: Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture noncontact human–object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human–object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.
PaperID: 1718,   Poster  https://arxiv.org/pdf/2511.10946    
Authors: Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister
Title: Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
Abstract: Visionlanguage models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents.We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input.To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance.These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.
PaperID: 1719,   Poster  https://arxiv.org/pdf/2512.01061    
Authors: Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao, Zhengyi Luo, Xingye Da, Fernando Castañeda, Guanya Shi, Shankar Sastry, Linxi Fan, Yuke Zhu
Title: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer
Abstract: Recent progress in GPUaccelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher–student–bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure designed to mitigate partial observability and improve closed-loop consistency in sim-to-real RL. Trained entirely on synthetic simulation data, the resulting policy achieves robust zero-shot performance across diverse articulated objects—including multiple door types—and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation from pure RGB perception.
PaperID: 1720,   Poster  https://arxiv.org/pdf/2603.25738    
Authors: Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao
Title: PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
Abstract: Graphic design is a creative and innovative process that plays a crucial role in applications such as ecommerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we proposePSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components,PSDesignercollects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset,CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate thatPSDesigneroutperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.
PaperID: 1721,   Poster  https://arxiv.org/pdf/2511.00261    
Authors: Neha Balamurugan, Sarah A Wu, Cristobal Eyzaguirre, Tobias Gerstenberg
Title: Spot The Ball: A Benchmark for Visual Social Inference
Abstract: Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This capacity drives everyday social reasoning in humans and is critical for developing more humanlike AI agents. We introduce \stb, a challenging benchmark for evaluating visual social inference in vision–language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20–34%) than models (\leq17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human–model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
PaperID: 1722,   Poster  https://arxiv.org/pdf/2604.04563    
Authors: Hanbin Ko, Kyungmin Jeon, Doowoong Choi, Chang Min Park
Title: Temporal Inversion for Learning Interval Change in Chest X-Rays
Abstract: Recent advances in visionlanguage pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion---reversing image pairs---as a supervisory signal for temporal reasoning. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of directional change. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-T_retrieval, a benchmark for progression-aware retrieval. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment across multiple architectures. Overall, temporal inversion provides a simple and general principle for building order-aware medical vision--language models and supports temporally robust reasoning.
PaperID: 1723,   Poster  https://arxiv.org/pdf/2511.18173    
Authors: Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, Jürgen Gall
Title: EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
Abstract: Egocentric video generation with finegrained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
PaperID: 1724,   Poster  https://arxiv.org/pdf/2512.20538    
Authors: Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik
Title: AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Abstract: Singleview RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and observed image features across all views simultaneously.Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.
PaperID: 1725,   Poster  https://arxiv.org/pdf/2602.24084    
Authors: Matteo Ballegeer, Dries Benoit
Title: FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Abstract: Learning directly from boundary representations (Breps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary SO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.
PaperID: 1726,   Poster  https://arxiv.org/pdf/2511.00511    
Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong Mu
Title: ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Abstract: Significant progress has been achieved in highfidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.
PaperID: 1727,   Poster  https://arxiv.org/pdf/2508.03173    
Authors: Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, Cheng Tan
Title: Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions
Abstract: Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce GeointR1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.
PaperID: 1728,   Poster  https://arxiv.org/pdf/2511.18673    
Authors: Yiqing Shi, Yiren Song, Mike Zheng Shou
Title: Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
Abstract: Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on textto-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
PaperID: 1729,   Poster  https://arxiv.org/pdf/2602.24020    
Authors: Xiang Feng, Xiangbo Wang, Tieshi Zhong, Chengkai Wang, Yiting Zhao, Tianxiang Xu, Zhenzhong Kuang, Feiwei Qin, Xuefei Yin, Yanming Zhu
Title: SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
Abstract: 3D superresolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes. Codes will be released upon publication.
PaperID: 1730,   Poster  https://arxiv.org/pdf/2512.04660    
Authors: Juntong Wang, Wang Jiarui, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, Xiongkuo Min
Title: I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Abstract: Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose I2IBench, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences.Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
PaperID: 1731,   Poster  https://arxiv.org/pdf/2603.02802    
Authors: Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang, Hubery Yin, Chen Li, Jing LYU, Caifeng Shan, Chenyang Si
Title: NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
Abstract: Recent video editing models have achieved impressive results, but most still require largescale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control & Dense Synthesis, a new framework for video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
PaperID: 1732,   Poster  https://arxiv.org/pdf/2511.17583    
Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen
Title: Learning Straight Flows: Variational Flow Matching for Efficient Generation
Abstract: Flow Matching has limited ability in achieving onestep generation due to its reliance on learned curved trajectories. Previous studies have attempted to address this limitation by either modifying the coupling distribution to prevent interpolant intersections or introducing consistency and mean-velocity modeling to promote straight trajectory learning. However, these approaches often suffer from discrete approximation errors, training instability, and convergence difficulties. To tackle these issues, in the present work, we propose Straight Variational Flow Matching (S-VFM), which integrates a variational latent code representing the ``generation overview'' into the Flow Matching framework. S-VFM explicitly enforces trajectory straightness, ideally producing linear generation paths. The proposed method achieves competitive performance across three challenge benchmarks and demonstrates advantages in both training and inference efficiency compared with existing methods.
PaperID: 1733,   Poster  https://arxiv.org/pdf/2511.16825    
Authors: Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Yu-Ying Yeh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Yuchen Fan, Sergiu Oprea, Ziyan Wang, Roman Shapovalov, Nikolaos Sarafianos, Thibault Groueix, Antoine Toisoul, Prithviraj Dhar, Xiao Chu, Minghao Chen, Geon Yeong Park, Rakesh Ranjan, Andrea Vedaldi
Title: WorldGen: From Text to Traversable and Interactive 3D Worlds
Abstract: We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, highquality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is then decomposed into individual entities, which are regenerated at higher resolution, synthesizing additional details with guidance from the image generator. We ablate key design choices and compare qualitatively against existing scene generators, showing that our design addresses many of their common challenges.
PaperID: 1734,   Poster  https://arxiv.org/pdf/2603.07898    
Authors: Chen-Chen Zong, YuQi Chi, Xie-Yang Wang, Yan Cui, Sheng-Jun Huang
Title: Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
Abstract: Openset active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes—a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning.In this paper, we propose E^2OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E^2OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity.Extensive experiments across multiple OSAL benchmarks demonstrate that E^2OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications.
PaperID: 1735,   Poster  https://arxiv.org/pdf/2603.25985    
Authors: Qirui Wu, Mohd Yawar Nihal Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Richard Newcombe, Angel Xuan Chang, Jakob Engel, Henry Howard-Jenkins
Title: JRM: Joint Reconstruction Model for Multiple Objects without Alignment
Abstract: Objectcentric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM’s implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.
PaperID: 1736,   Poster  https://arxiv.org/pdf/2604.10950    
Authors: Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon
Title: Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
Abstract: Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pretrained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA’s effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.
PaperID: 1737,   Poster  https://arxiv.org/pdf/2504.00952    
Authors: Kumar Kshitij Patel, Bingqing Jiang, A F M Mahfuzul Kabir, Weitong Zhang, Difan Zou, Lingxiao Wang
Title: Personalized Federated Training of Diffusion Models with Privacy Guarantees
Abstract: We propose a federated framework for training diffusion models on decentralized and private datasets. The method learns a shared generative model together with personalized client models, which allows clients to benefit from crossclient structure while ensuring that the shared model cannot reproduce any client’s data on its own. We provide formal differential privacy guarantees for each client and establish utility bounds for conditional generation under a Gaussian mixture model, showing that collaboration improves sample quality relative to private non-collaborative training. Experiments on CIFAR-10, Colorized MNIST, and CelebA support these results: the method generates high-fidelity samples, improves performance on minority and underrepresented classes, and maintains strong protection against membership inference, memorization, and reconstruction attacks.
PaperID: 1738,   Poster  https://arxiv.org/pdf/2512.13465    
Authors: Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi, Lizhuang Ma
Title: PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence
Abstract: Poseguided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
PaperID: 1739,   Poster  https://arxiv.org/pdf/2603.28162    
Authors: Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zhibo Chen
Title: ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Abstract: Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization.This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structurecolor decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.
PaperID: 1740,   Poster  https://arxiv.org/pdf/2604.02603    
Authors: Kunzhe Song, Geo Zhou, Xiaoming Liu, Huacheng Zeng
Title: Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
Abstract: Robust 3D environmental perception is critical for applications like autonomous navigation and robotics, yet existing optical sensors like cameras and LiDAR fail in adverse conditions such as smoke, fog, and nonideal lighting. While specialized radar systems can operate in these conditions, their reliance on bespoke, ultra-wideband hardware and licensed spectrum limits their scalability and cost-effectiveness. This paper introduces Rascene, a novel framework that enables high-fidelity 3D imaging by repurposing ubiquitous mmWave OFDM communication signals. Recognizing that a single-frame RF signal is inherently sparse, noisy, and highly ambiguous, the key innovation of Rascene is a multi-frame 3D imaging framework designed to fuse information from signals captured across multiple, arbitrary poses. This framework leverages a spatially adaptive fusion mechanism to find geometric consensus across the multiple views, effectively suppressing multipath artifacts while preserving sparse geometric details. Experiments demonstrate that our method reconstructs 3D scenes with high precision, providing a new pathway for low-cost, scalable, and robust 3D perception.
PaperID: 1741,   Poster  https://arxiv.org/pdf/2512.02425    
Authors: Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
Title: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hoursor days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
PaperID: 1742,   Poster  https://arxiv.org/pdf/2602.19180    
Authors: Wenhao Shen, Hao Wang, Wanqi Yin, Fayao Liu, Xulei Yang, Chao Liang, Zhongang Cai, Guosheng Lin
Title: VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery
Abstract: Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent probabilistic and diffusionbased methods tackle this ambiguity by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this issue, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Building upon this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art probabilistic HMR approaches.
PaperID: 1743,   Poster  https://arxiv.org/pdf/2501.05264    
Authors: Mengshi Qi, Jiaxuan Peng, Xianlin Zhang, Huadong Ma
Title: Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
Abstract: 3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGBbased methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. We will release our codes soon.
PaperID: 1744,   Poster  https://arxiv.org/pdf/2603.29842    
Authors: Minyoung Kim, Dae Hee Yun, Aditi Patel, Madeline Hon, Webster Guan, Taegeon Lee, Brian Nguyen
Title: Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
Abstract: Unprecedented visual details of biological structures are being revealed by subcellularresolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.
PaperID: 1745,   Poster  https://arxiv.org/pdf/2506.05328    
Authors: Lidong Lu, Guo Chen, Zhu Wei, Zhiqi Li, Yicheng Liu, Tong Lu
Title: AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, closeset queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves SOTA results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments reveal that on out-of-domain benchmarks, reasoning in the language space offers limited performance gains, suggesting the need for more robust cross-domain reasoning mechanisms.
PaperID: 1746,   Poster  https://arxiv.org/pdf/2511.19945    
Authors: Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han
Title: Low-Resolution Editing is All You Need for High-Resolution Editing
Abstract: Highresolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.
PaperID: 1747,   Poster  https://arxiv.org/pdf/2603.17583    
Authors: SeongRae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
Title: EditAs-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
Abstract: Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing openvocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task.We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space.Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations.By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility—three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
PaperID: 1748,   Poster  https://arxiv.org/pdf/2603.21077    
Authors: Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang
Title: CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
Abstract: Multimodal large language models (MLLMs) achieve remarkable progress in crossmodal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs. Code will be released.
PaperID: 1749,   Poster  https://arxiv.org/pdf/2507.13861    
Authors: Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Yang Song, Xianhua He, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang
Title: PositionIC: Unified Position and Identity Consistency for Image Customization
Abstract: Recent subjectdriven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms.To this end, we introduce PositionIC, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects.Extensive experiments demonstrate PositionIC achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios. Code and data will be publicly released.
PaperID: 1750,   Poster  https://arxiv.org/pdf/2603.16284    
Authors: Tiantian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang
Title: Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
Abstract: Despite the significant advancements in Large VisionLanguage Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment.Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs.However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks.In this paper, we propose a plug-and-play framework calledLocate-Then-Sparsify forFeatureSteering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer.We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers.Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance on general LVLM benchmarks. Codes are provided in the supplementary.
PaperID: 1751,   Poster  https://arxiv.org/pdf/2603.23906    
Authors: Yang yuhuan, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang
Title: GenMask: Adapting DiT for Segmentation via Direct Mask Generation
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation.It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation.In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner.We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents.To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training.We present GenMask, a DiT trains to generate blackand-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.
PaperID: 1752,   Poster  https://arxiv.org/pdf/2603.07071    
Authors: Xueqing Yu, Bohan Li, Yan Li, Zhenheng Yang
Title: VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
Abstract: Recent VisionLanguage Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model’s input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly.To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.
PaperID: 1753,   Poster  https://arxiv.org/pdf/2603.18510    
Authors: Hongjia Zhai, Qi Zhang, Xiaokun Pan, Xiyu Zhang, Yitong Dong, Huaqi Zhang, Dan Xu, Guofeng Zhang
Title: OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
Abstract: Openvocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.
PaperID: 1754,   Poster  https://arxiv.org/pdf/2510.18362    
Authors: Duoxun Tang, Xi Xiao, Guangwu Hu, Kangkang Sun, Xiao Yang, Dongyang Chen, Qing Li, Yong-jie Yin, Jiyao Wang
Title: FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Abstract: The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing blackbox adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
PaperID: 1755,   Poster  https://arxiv.org/pdf/2511.21732    
Authors: Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su
Title: HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing datadriven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.
PaperID: 1756,   Poster  https://arxiv.org/pdf/2509.18151    
Authors: Jindi Lv, Yuhao Zhou, Yuxin Tian, Qing Ye, Wentao Feng, Jiancheng Lv
Title: HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork
Abstract: Timeintensive performance evaluations significantly impede progress in Neural Architecture Search (NAS). To address this, neural predictors leverage surrogate models trained on proxy datasets, allowing for direct performance predictions for new architectures.However, these predictors often exhibit poor generalization due to their limited ability to capture intricate relationships among various architectures. In this paper, we propose HyperNAS, a novel neural predictor paradigm for enhancing architecture representation learning. HyperNAS consists of two primary components: a global encoding scheme and a shared hypernetwork. The global encoding scheme is devised to capture the comprehensive macro-structure information, while the shared hypernetwork serves as an auxiliary task to enhance the investigation of inter-architecture patterns. To ensure training stability, we further develop a dynamic adaptive multi-task loss to facilitate personalized exploration on the Pareto front. Extensive experiments across five representative search spaces, including ViTs, demonstrate the advantages of HyperNAS, particularly in few-shot scenarios. For instance, HyperNAS strikes new state-of-the-art results, with 97.60% top-1 accuracy on CIFAR-10 and 82.4% top-1 accuracy on ImageNet, using at least 5.0× fewer samples.
PaperID: 1757,   Poster  https://arxiv.org/pdf/2603.09377    
Authors: CHEN Yang, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu
Title: SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
Abstract: Robust crossview geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.
PaperID: 1758,   Poster  https://arxiv.org/pdf/2603.13133    
Authors: zihao xin, Wentong Li, Yixuan Jiang, Bin Wang, Runmin Cong, Jie Qin, Sheng-Jun Huang
Title: DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
Abstract: Visionand-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent selectively collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of our DecoVLN, and we have depolyed it in real-world environments.. Codes and models will be released publicly.
PaperID: 1759,   Poster  https://arxiv.org/pdf/2507.12336    
Authors: Subin Jeon, In Cho, Junyoung Hong, Woong Cho, Seon Joo Kim
Title: Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
Abstract: Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multiview images, both of which are expensive to collect.This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions.To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model.In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model.We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes.Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model.Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
PaperID: 1760,   Poster  https://arxiv.org/pdf/2603.17655    
Authors: Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Abstract: CrossDomain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find thatthe domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features.To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance. Our codes will be released.
PaperID: 1761,   Poster  https://arxiv.org/pdf/2603.23186    
Authors: Yeonkyung Lee, Dayun Ju, Youngmin Kim, seil kang, Seong Jae Hwang
Title: ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Abstract: Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiencyoriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword–Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and preserves dense-frame baseline performance with as few as 20% of frames.
PaperID: 1762,   Poster  https://arxiv.org/pdf/2604.08031    
Authors: Jiawei Liu, Xun Gong, Fen Fang, Muli Yang, Bohao Qu, Yunfeng hu, Hong Chen, Xulei Yang, Qing Guo
Title: Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
Abstract: Most HumanMachine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals—without sacrificing interpretability and traceability—remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding, please refer to the videos in the Supplementary Material.
PaperID: 1763,   Poster  https://arxiv.org/pdf/2604.02780    
Authors: RUIZE GAO, Kaiwen Zhou, Yongqiang Chen, Feng Liu
Title: A Unified Perspective on Adversarial Membership Manipulation in Vision Models
Abstract: Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model’s training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push nonmember images into the “member’’ region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature—a characteristic gradient-norm collapse trajectory—that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.
PaperID: 1764,   Poster  https://arxiv.org/pdf/2603.13741    
Authors: Jae Yong Lee, Daniel Scharstein, Akash Bapat, Hao Hu, Andrew Fu, Haoru Zhao, Paul Sammut, Xiang Li, Stephen Jeapes, Anik Gupta, Lior David, Saketh Madhuvarasu, JAY JOSHI, Jason Wither
Title: Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
Abstract: We present Ego1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods. We believe this is an important area of research as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain, enabling 4D world creation and sharing.
PaperID: 1765,   Poster  https://arxiv.org/pdf/2511.23334    
Authors: Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao
Title: Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Abstract: Visual AutoRegressive modeling (VAR) based on nextscale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256×256) and decreases peak memory consumption by 83.8% (1024×1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.
PaperID: 1766,   Poster  https://arxiv.org/pdf/2602.17807    
Authors: Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero, Tommie Kerssies, Bastian Leibe, Gijs Dubbelman, Daan de Geus
Title: VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Abstract: Existing online video segmentation models typically combine a perframe segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone.Code will be made publicly available.
PaperID: 1767,   Poster  https://arxiv.org/pdf/2511.14109    
Authors: Zhenyu Li, Tianyi Shang
Title: $A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors
Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. Stateof-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called A^2GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.
PaperID: 1768,   Poster  https://arxiv.org/pdf/2511.18729    
Authors: Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, Junqiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, Yadan Luo
Title: GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
Abstract: Driving planning is a critical component of endto-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, GuideFlow unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, GuideFlow parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of GuideFlow. Notably, on the NavSim test hard split (Navhard), GuideFlow achieved SOTA with an EPDMS score of 43.0. The code will be released.
PaperID: 1769,   Poster  https://arxiv.org/pdf/2603.27101    
Authors: Gedeon Muhawenayo, Caleb Robinson, Subash Khanal, Zhanpei Fang, Isaac Corley, Alexander Wollam, Tianyi Gao, Leonard Strnad, Ryan Avery, Lyndon Estes, Ana Tárano, Nathan Jacobs, Hannah Kerner
Title: PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Abstract: Largescale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping have undesirable properties for large-scale inference, including sensitivity to illumination, spatial scale, and geographic location changes. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFM) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76% IoU and 47% object-F1 on the FTW benchmark, an increase of 6% and 9% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference.We release trained models and model-derived field boundary datasets for 5 countries outside of the FTW dataset to support future research and deployment.
PaperID: 1770,   Poster  https://arxiv.org/pdf/2603.02280    
Authors: Jinge Ma, Fengqing Zhu
Title: Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
Abstract: With the widespread adoption of deep learning in visual tasks, ClassIncremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge ofcatastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor—temporal imbalance—as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose the Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.
PaperID: 1771,   Poster  https://arxiv.org/pdf/2510.14431    
Authors: Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu
Title: Real-Time Neural Video Compression with Unified Intra and Inter Coding
Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding stateof-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
PaperID: 1772,   Poster  https://arxiv.org/pdf/2506.14697    
Authors: Zonghao Ying, Le Wang, Yisong Xiao, Jiakai Wang, Yuqing Ma, Jinyang Guo, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Title: AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Abstract: The integration of visionlanguage models (VLMs) is driving a new generation of embodied agents capable of operating in human-centered environments. However, as deployment expands, these systems face growing safety risks, particularly when executing hazardous instructions. Current safety evaluation benchmarks remain limited: they cover only narrow scopes of hazards and focus primarily on final outcomes, neglecting the agent's full perception-planning-execution process and thereby obscuring critical failure modes. Therefore, we present SAFE, a benchmark for systematically assessing the safety of embodied VLM agents on hazardous instructions. SAFE comprises three components: SAFE-THOR, an extensible adversarial simulation sandbox with a universal adapter that maps high-level VLM outputs to low-level embodied controls, supporting diverse agent workflow integration; SAFE-VERSE, a risk-aware task suite inspired by Asimov's Three Laws of Robotics, comprising 45 adversarial scenarios, 1,350 hazardous tasks, and 9,900 instructions that span risks to humans, environments, and agents; and SAFE-DIAGNOSE, a multi-level and fine-grained evaluation protocol measuring agent performance across perception, planning, and execution. Applying SAFE to nine state-of-the-art VLMs and two embodied agent workflows, we uncover systematic failures in translating hazard recognition into safe planning and execution. Our findings reveal fundamental limitations in current safety alignment and demonstrate the necessity of a comprehensive, multi-stage evaluation for developing safer embodied intelligence.
PaperID: 1773,   Poster  https://arxiv.org/pdf/2603.06688    
Authors: Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye Lu
Title: Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
Abstract: We present Narrative Weaver, a novel framework that addresses a fundamental challenge in generative AI: achieving controllable, longrange, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences—a critical limitation for real-world applications such as filmmaking and e-commerce advertising.Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift.To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD)—the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations.Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method’s superiority while opening new possibilities for AI-driven content creation.
PaperID: 1774,   Poster  https://arxiv.org/pdf/2603.09094    
Authors: Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei
Title: Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling realworld physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. \textcolor[RGB]237,0, 140Our code will be released soon.
PaperID: 1775,   Poster  https://arxiv.org/pdf/2601.07700    
Authors: Jakob Zimmermann, Georg Loho
Title: Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
Abstract: It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network.We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods SplitCAM and SplitLRP --improve onstate of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories.Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.
PaperID: 1776,   Poster  https://arxiv.org/pdf/2409.17385    
Authors: Ruining Yang, Yi Xu, Yun Fu, Lili Su
Title: Den-TP: Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
Abstract: Trajectory prediction in autonomous driving has traditionally been studied from a modelcentric perspective. However, existing datasets exhibit a strong long-tail distribution in scenario density, the number of agents per scenario, where common low-density cases dominate and safety-critical high-density cases are severely underrepresented. This imbalance limits model robustness and hides failure modes when standard evaluations average errors across all scenarios. We revisit trajectory prediction from a data-centric angle and present Den-TP, a framework for density-aware dataset curation and evaluation. Den-TP first partitions data into density-conditioned regions using agent count as a lightweight, dataset-agnostic proxy for interaction complexity. It then applies gradient-based utilities with a submodular selection objective to choose representative samples within each region while explicitly rebalancing across densities. The resulting subset reduces dataset size by 50% yet preserves overall performance and significantly improves robustness in high-density scenarios. We further introduce density-conditioned evaluation protocols that reveal long-tail failure modes overlooked by conventional metrics. Experiments on Argoverse 1 and 2 with state-of-the-art models show that robust trajectory prediction hinges not only on data scale, but also on balancing scenario density.
PaperID: 1777,   Poster  https://arxiv.org/pdf/2512.06158    
Authors: Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Chen, Mei Chen
Title: Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
Abstract: Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixelor latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \emphTrack4DGen, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling.\emphTrack4DGen surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \emphSketchfab28, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.
PaperID: 1778,   Poster  https://arxiv.org/pdf/2603.24749    
Authors: David Shatwell, Swetha Sirnam, Mubarak Shah
Title: TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
Abstract: Many realworld applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image–location–time triplets for training and 85k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer–based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time–aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.
PaperID: 1779,   Poster  https://arxiv.org/pdf/2603.24942    
Authors: Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li
Title: BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semanticpreserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures.To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ''image \to noise'' and ''noise \to image'' directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones.Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability. Our code and models will be released upon acceptance.
PaperID: 1780,   Poster  https://arxiv.org/pdf/2511.17282    
Authors: Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-seng Chua
Title: Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Abstract: Multilingual textto-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilised. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts.Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
PaperID: 1781,   Poster  https://arxiv.org/pdf/2603.25042    
Authors: Wonjoon Lee, Sungmin Woo, Donghyeong Kim, Jungho Lee, Sangheon Park, Sangyoun Lee
Title: MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multiview inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse set of key views as a lightweight motion cue to guide per-Gaussian motion toward the scene’s true dynamics. To compensate for the sparsity and view-dependence of flow, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.
PaperID: 1782,   Poster  https://arxiv.org/pdf/2512.07821    
Authors: Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
Title: WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Abstract: Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatiotemporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement.We train WorldReel by carefully combining synthetic and real data: synthetic data provides precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity.Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
PaperID: 1783,   Poster  https://arxiv.org/pdf/2510.01448    
Authors: Angel Daruna, Nicholas Meegan, Han-Pang Chiu, Supun Samarasekera, Rakesh “Teddy” Kumar
Title: GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
Abstract: Worldwide visual geolocalization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Unlike prior work, our geographic representation explicitly models the world as a hierarchy of learned geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods. Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.
PaperID: 1784,   Poster  https://arxiv.org/pdf/2512.12887    
Authors: Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic
Title: Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification
Abstract: 3D medical image classification is essential to modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: dataregime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. It allows efficient scaling to new tasks by adding only lightweight plugins (~1M parameters per task) to a single frozen backbone. Besides, this versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities and systematically evaluate state-of-the-art 3D classification techniques. Our analysis reveals several key insights: (1) effective adaptation is critical to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (e.g., 1st place in the challenge), eliminating the need for separate task-specific 3D models.
PaperID: 1785,   Poster  https://arxiv.org/pdf/2603.06917    
Authors: Zhengjian Kang, Jun Zhuang, Kangtong Mo, Qi Chen, Rui Liu, Ye Zhang
Title: PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Abstract: Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an endto-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localization–classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5%–4.2% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.
PaperID: 1786,   Poster  https://arxiv.org/pdf/2603.27033    
Authors: Logan Lawrence, Oindrila Saha, Rangel Daroya, Mustafa Chasmai, Wuao Liu, Max Hamilton, Aaron Sun, Seoyun Jeong, Fabien Delattre, Subhransu Maji, Grant Horn
Title: RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
Abstract: Finegrained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (vocalization, range, season), or obscured due to occlusion, camera angle, or low resolution. Yet today’s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rational (e.g., “requires vocalization,” “out of range,” “view obstructed”). For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (\leq 17% accuracy including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.
PaperID: 1787,   Poster  https://arxiv.org/pdf/2511.14099    
Authors: Jingren Liu, Shuning Xu, Qirui Yang, WANG Yun, Xiangyu Chen, Zhong Ji
Title: FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
Abstract: Allin-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
PaperID: 1788,   Poster  https://arxiv.org/pdf/2512.04733    
Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Cheng-Zhong Xu
Title: E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Abstract: Endto-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they ignore the passenger’s emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) AD, where an autonomous vehicle must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valence-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.
PaperID: 1789,   Poster  https://arxiv.org/pdf/2512.15160    
Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Ding Yuan, Hong Zhang, Yifan Yang
Title: EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with blackbox reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images’’ (e.g., ChatGPT–o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics–perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision–language models, demonstrating strong and generalizable spatial understanding.
PaperID: 1790,   Poster  https://arxiv.org/pdf/2512.21194    
Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Huynh, Wamiq Reyaz Para, Phúc H. Lê Khắc, Ankit Singh, Sofian Chaybouti, Sanath Narayan
Title: VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
Abstract: VisionLanguage Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To adress this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near randomly under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
PaperID: 1791,   Poster  https://arxiv.org/pdf/2601.08834    
Authors: Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Siqi Yang, Lin Ma
Title: Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
Abstract: Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emphe.g., formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance.To address this, we propose format decoupled reinforcement learning (FDRL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark.More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.
PaperID: 1792,   Poster  https://arxiv.org/pdf/2511.12978    
Authors: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
Title: Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
Abstract: Contrastive vision–language models (VLMs) such as CLIP achieve strong zeroshot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP’s own patch embeddings to group spatial patches into semantically coherent clusters, masking them, and evaluating relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI when combined with GroundedSAM, automatically categorizes predictions as foreground or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we conduct a comprehensive evaluation of eighteen CLIP variants, providing both methodological advances and empirical evidence that chart a path toward more robust vision–language models.
PaperID: 1793,   Poster  https://arxiv.org/pdf/2601.07462    
Authors: Shikang Zheng, Guantao Chen, Landis He, Jiacheng Liu, Yuqi Lin, Chang Zou, Linfeng Zhang
Title: From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution
Abstract: Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic renoising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose Fresco, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10× speedup on FLUX, and 5× on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22× speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.
PaperID: 1794,   Poster  https://arxiv.org/pdf/2512.04485    
Authors: Aaron Sun, Oindrila Saha, Subhransu Maji
Title: Not All Birds Look The Same: Identity-Preserving Generation For Birds
Abstract: Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users.Zeroshot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning.While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data—especially videos or multi-view observations of the same subject—making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail.Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds.We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex---used as a proxy for identity---substantially improves performance on both seen and unseen species.
PaperID: 1795,   Poster  https://arxiv.org/pdf/2506.21011    
Authors: Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu
Title: Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring
Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions (e.g., noise), restricting its applicability. Benefiting from the humanfriendly linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems (e.g., GPT-4), limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's (HVS) reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs. The code and dataset will be publicly available for future research.
PaperID: 1796,   Poster  https://arxiv.org/pdf/2511.22169    
Authors: Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang, Seongeun Jeong, Eunhye Kim, Soontae Kim, Hyunjung Shim
Title: Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Abstract: Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss regionspecific dynamics and rely on non–real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.
PaperID: 1797,   Poster  https://arxiv.org/pdf/2602.19565    
Authors: Li Zhang, Mingyu Mei, Ailing Wang, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, RujingWang RujingWang, Zaixing He, Cewu Lu
Title: DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
Abstract: Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the groundtruth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure.We validate DICArt on both synthetic and real-world datasets with multi-hinged articulated objects. Experimental results demonstrate its superior performance and robustness over state-of-the-art baselines. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments. Codewill be publicly available upon acceptance.
PaperID: 1798,   Poster  https://arxiv.org/pdf/2512.16250    
Authors: Sanjoy Chowdhury, Karren Dai Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli
Title: AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
Abstract: Recent multimodal large language models (MLLMs) such as GPT4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMusE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMusE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities. To facilitate further research we will publicly release our code and benchmark.
PaperID: 1799,   Poster  https://arxiv.org/pdf/2509.24850    
Authors: bo zhao, Dan Guo, Junzhe Cao, Yong Xu, Bochao Zou, Tao Tan, Yue Sun, Zitong YU
Title: PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement
Abstract: Remote photoplethysmography (rPPG) measurement enables noncontact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, limiting robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier–Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution, justifying the use of a Temporal Convolutional Network (TCN). Based on this principle, we design the PHASE-Net, a lightweight model with three key components: 1) Zero-FLOPs Axial Swapper module to swap or transpose a few spatial channels to mix distant facial regions, boosting cross-region feature interaction without changing temporal order; 2) Adaptive Spatial Filter to learn a soft spatial mask per frame to highlight signal-rich areas and suppress noise for cleaner feature maps; and 3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance and strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.
PaperID: 1800,   Poster  https://arxiv.org/pdf/2602.02989    
Authors: Zhanfeng Liao, Jiajun Zhang, Hanzhang Tu, Zhixi Wang, Yunqi Gao, Hongwen Zhang, Yebin Liu
Title: SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
Abstract: Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussianbased representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation.Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitive’s motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity.Moreover, we design a lifespan–velocity–aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable.Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.
PaperID: 1801,   Poster  https://arxiv.org/pdf/2603.20176    
Authors: Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi
Title: LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
Abstract: Novel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that networkbased rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generative completions.
PaperID: 1802,   Poster  https://arxiv.org/pdf/2603.24692    
Authors: Wuque Cai, Hongze Sun, Quan Tang, Shifeng Mao, Zhenxing Wang, Jiayi He, Duo Chen, Dezhong Yao, Daqing Guo
Title: Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
Abstract: Spiking neural networks (SNNs) are well known for their high energy efficiency and strong temporal processing capabilities. However, the multilayer architectures of SNNs often incur substantial costs of communication, computation, and storage capacity. Inspired by biological autapses, we develop a simple yet effective framework for reconstructing spiking neural networks using a single neuron with timedelayed autapses (TDA-SNN), integrating a dedicated prototype learning-based optimization method. This design allows a single spiking neuron to dynamically reconfigure its internal temporal states, effectively emulating large-scale architectures such as reservoirs, multilayer perceptrons, and convolutional layers while maintaining efficient learning. Extensive experiments on sequential, event-stream, and image datasets demonstrate that TDA-SNN achieves performance comparable to deep SNNs, while significantly reducing computational overhead and enhancing internal information storage capacity. These results highlight the potential of single-neuron models as compact and efficient computational units, offering new insights into the development of biologically inspired neuromorphic systems.
PaperID: 1803,   Poster  https://arxiv.org/pdf/2512.23650    
Authors: Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang, Zhenguo Sun, Yibo Peng, Pengwei Wang, Zhongyuan Wang, Fangzhou Liu, Chang Xu, Shanghang Zhang
Title: Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
Abstract: Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acousticactuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive freestyle performers capable of reacting to audio.
PaperID: 1804,   Poster  https://arxiv.org/pdf/2503.22174    
Authors: Jialun Pei, Zhangjun Zhou, Diandian Guo, Zhixi Li, Jing Qin, Bo Du, Pheng-Ann Heng
Title: Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
Abstract: Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process and increases the risk of postoperative complications. Intelligent detection of bleeding areas can quantify the blood loss to assist decisionmaking, while locating bleeding points helps surgeons quickly identify the source of bleeding and achieve hemostasis in time to improve surgical success rates. To fill the benchmark gap, we first construct a real-world laparoscopic surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, enabling simultaneous detection of bleeding regions and points in laparoscopic surgery. The baseline embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2. The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures point motion direction via inter-frame optical flow. By coupled bidirectional guidance, our framework explores spatial-temporal correlations while exploiting memory modeling to infer current bleeding status. Extensive experiments indicate that our method outperforms 13 counterparts in bleeding detection. Code and data are available.
PaperID: 1805,   Poster  https://arxiv.org/pdf/2603.01040    
Authors: Heewon Park, Mugon Joe, Miru Kim, Kyungjin Im, Minhae Kwon
Title: Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
Abstract: Federated learning (FL) in postdeployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label, covariate, and concept) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.
PaperID: 1806,   Poster  https://arxiv.org/pdf/2512.15940    
Authors: Tin Sohn Sohn, Maximilian Dillitzer, Jason Corso, Eric Sax
Title: R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Abstract: Humans perceive and reason about their surroundings in four dimensionsthree spatial and one temporal axis-by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning through an iterative retrieval-reasoning loop. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in structured 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
PaperID: 1807,   Poster  https://arxiv.org/pdf/2504.11101    
Authors: Yulong Zhang, Tianyi Liang, Erfei Cui, Guoqing Wang, Xu Guo, Chenhui Li, Gongshen Liu
Title: Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Abstract: Optical Character Recognition (OCR) is fundamental to VisionLanguage Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control.We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.
PaperID: 1808,   Poster  https://arxiv.org/pdf/2512.06835    
Authors: Tingyu Li, Zheng Sun, Jingxuan Wei, Conghui He, Lijun Wu, Cheng Tan
Title: Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
Abstract: Recent visionlanguage models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data—especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
PaperID: 1809,   Poster  https://arxiv.org/pdf/2603.01305    
Authors: Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu, Zhengtao Zhang, Xingang Wang
Title: AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Abstract: Large multimodal models (LMMs) exhibit strong taskgeneralization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation still faces fundamental limitations: anomaly semantics are scarce and unstructured, and the weak alignment between textual prompts and visual features makes accurate anomaly localization difficult.To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchors—[SEG], [NOR], and [ANO]—and introduces a unified anchor-guided segmentation paradigm. Specifically, [SEG] functions as an absolute semantic anchor that injects pixel-level structural priors into LMMs, while [NOR] and [ANO] serve as relative semantic anchors that encode the contrastive semantics between normality and abnormality across categories. To further enhance alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that bridges the gap between the LMM semantic space and high-resolution visual features, and design an Anchor-Guided Mask Decoder (AGMD) that performs anchor-consistent querying for precise anomaly localization.In addition, we construct Anomaly-Instruct20K, a large-scale instruction dataset that provides structured anomaly knowledge—including appearance, shape, and spatial attributes—to help LMMs effectively learn and integrate the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting. Code will be released upon acceptance.
PaperID: 1810,   Poster  https://arxiv.org/pdf/2601.03824    
Authors: Wei Long, Haifeng Wu, SHIYIN JIANG, Jinhua Zhang, Xinchun Ji, Shuhang Gu
Title: IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
Abstract: Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feedforward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates multi-level epipolar attention maps in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively updates the depth candidates and boosts probability estimation, the depth map is refined, resulting in accurate Gaussian means. Finally, for the other Gaussian parameters, we design a Gaussian Focused Module (GFM) to determine the most relevant Gaussian tokens for feature interaction. We conduct experiments on RealEstate10K and ACID. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.
PaperID: 1811,   Poster  https://arxiv.org/pdf/2510.22319    
Authors: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Title: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Abstract: Recently, GRPObased reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution—its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters animplicit over-optimization stage—while the proxy reward continues to increase, essential metrics such as image quality and text–prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduceGRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality. We provide detailed demonstrations of the over-optimization process and corresponding visualizations inSupplementary Materials. 5.
PaperID: 1812,   Poster  https://arxiv.org/pdf/2602.21333    
Authors: Yifan Wang, Francesco Pittaluga, Zaid Tasneem, Chenyu You, Manmohan Chandraker, Ziyu Jiang
Title: HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Abstract: Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling finegrained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these finding, HorizonSuite establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation. achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method.
PaperID: 1813,   Poster  https://arxiv.org/pdf/2505.20938    
Authors: Chongjie Si, Yidan Cui, Fuchao Yang, Wei Shen
Title: Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning
Abstract: Partial MultiLabel Learning (PML) extends the multi-label learning paradigm to scenarios where each sample is associated with a candidate label set containing both ground-truth labels and noisy labels. Existing PML methods commonly rely on two assumptions: sparsity of the noise label matrix and low-rankness of the ground-truth label matrix. However, these assumptions are inherently conflicting and impractical for real-world scenarios, where the true label matrix is typically full-rank or close to full-rank. To address these limitations, we demonstrate that the sparsity constraint contributes to the high-rank property of the predicted label matrix. Based on this, we propose a novel method Schirn, which introduces a sparsity constraint on the noise label matrix while enforcing a high-rank property on the predicted label matrix. Extensive experiments demonstrate the superior performance of Schirn compared to state-of-the-art methods, validating its effectiveness in tackling real-world PML challenges.
PaperID: 1814,   Poster  https://arxiv.org/pdf/2603.22882    
Authors: Chunxiao Li, Lijun Li, Jing Shao
Title: TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
Abstract: The rapid advancement of VisionLanguage Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.Warning: This paper contains examples of harmful texts and images, and reader discretion is recommended.
PaperID: 1815,   Poster  https://arxiv.org/pdf/2511.11434    
Authors: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-seng Chua
Title: WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Abstract: Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation.However, existing datasets and benchmarks focus primarily on singleturn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing.To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation.Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains.Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing.We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
PaperID: 1816,   Poster  https://arxiv.org/pdf/2602.21849    
Authors: Yuheng Li, Weitong Chen, chengcheng zhu, Jiale Zhang, Chunpeng Ge, Di Wu, Guodong Long
Title: Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
Abstract: Deep learningbased watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underlinesingle \underlinerandom \underlinedistortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches.As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underlinemeta-learning with \underlinefeature \underlineconsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown'' distortion for the meta-testing phase.Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch.To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent.Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59%, 4.71%, and 2.38% under high-intensity, combined, and unknown distortions.
PaperID: 1817,   Poster  https://arxiv.org/pdf/2512.01495    
Authors: Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai
Title: ELVIS: Enhance Low-light for Video Instance Segmentation in the Dark
Abstract: Video instance segmentation (VIS) for lowlight content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce ELVIS (Enhance Low-light for Video Instance Segmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to +3.7AP on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.
PaperID: 1818,   Poster  https://arxiv.org/pdf/2603.00141    
Authors: Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai
Title: From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
Abstract: Image Chainof-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2× speedup over Best-of-N.
PaperID: 1819,   Poster  https://arxiv.org/pdf/2511.19202    
Authors: Brent Zoomers, Florian Hahlbohm, Joni Vanherck, Lode Jorissen, Marcus Magnor, Nick Michiels
Title: NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
Abstract: 3D Gaussian Splatting can exploit frustum culling and levelof-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging tensor cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.
PaperID: 1820,   Poster  https://arxiv.org/pdf/2604.01884    
Authors: Xianben Yang, Tao Wang, Yuxuan Li, Yi Jin, Haibin Ling
Title: GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and realtime rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
PaperID: 1821,   Poster  https://arxiv.org/pdf/2511.11944    
Authors: Ling Wang, Yunfan Lu, Wenzong Ma, Huizai Yao, Pengteng Li, Hui Xiong
Title: From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing
Abstract: Clear imaging under hazy conditions is a critical task. Priorbased and neural methods have improved results.However, they operate on RGB frames, which suffer from limited dynamic range.Therefore, dehazing remains ill-posed and can erase structure and illumination details.To address this, we use event cameras for dehazing for the first time.Event cameras offer much higher HDR (120 dB~vs.~60 dB) and microsecond latency, therefore they suit hazy scenes.In practice, transferring HDR cues from events to frames is hard because real paired data are scarce.To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events.Specifically, we design an event-guided module that maps sparse HDR event features, e.g., edges, corners, into the diffusion latent space.This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift.For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.
PaperID: 1822,   Poster  https://arxiv.org/pdf/2604.10095    
Authors: Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang
Title: Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Abstract: With the emergence of 3D foundation models, such as DUSt3R, VGGT, and their variants, there is a growing interest in finetuning them for various downstream tasks, where using LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in geometry, texture, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA sub-spaces associated with each type of variation? 2) Are these sub-spaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these sub-spaces are approximately disentangled. Integrating them leads to a reduced LoRA sub-space that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA sub-space, despite derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.
PaperID: 1823,   Poster  https://arxiv.org/pdf/2602.24222    
Authors: Albert Dominguez Mantes, Gioele Manno, Martin Weigert
Title: MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Abstract: Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multiscale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modeling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.
PaperID: 1824,   Poster  https://arxiv.org/pdf/2511.15190    
Authors: Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun
Title: Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
Abstract: Masked autoregressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256×256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
PaperID: 1825,   Poster  https://arxiv.org/pdf/2604.16562    
Authors: Yanming Peng, Shijing Wang, Yaping Huang, Yi Tian
Title: See Through the Noise: Improving Domain Generalization in Gaze Estimation
Abstract: Generalizable gaze estimation methods have garnered increasing attention due to its critical importance in realworld applications and achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects for the generalization of gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels, mitigating the effects of label noise. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against both label noise and domain shifts. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlighting the importance of explicitly handling noise in generalized gaze estimation.
PaperID: 1826,   Poster  https://arxiv.org/pdf/2510.20776    
Authors: Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma, Shenghua Gao
Title: Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling
Abstract: We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our twostage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.
PaperID: 1827,   Poster  https://arxiv.org/pdf/2603.07506    
Authors: Jianlu Shen, Fu Feng, Jiaze Xu, Yucheng Xie, Jiaqi Lv, Xin Geng
Title: A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling
Abstract: Transferring pretrained knowledge from a source model to a target model of a different architectural size is a key challenge for flexible and efficient model scaling. However, current parameter-space methods treat Small-to-Large (S2L) and Large-to-Small (L2S) scaling as separate, incompatible problems, focusing on parameter synthesis and selection, respectively.This fragmented perspective has resulted in specialized tools, hindering a unified, bidirectional framework.In this paper, we propose BoT (Bidirectional knowledge Transfer), the first size-agnostic framework to unify S2L and L2S scaling.Our core insight is to treat model weights as continuous signals, where models of different sizes represent distinct discretizations of the transferable knowledge.This multi-resolution perspective directly casts S2L and L2S scaling as the signal processing operations of upsampling and downsampling, naturally leading to the adoption of the Discrete Wavelet Transform (DWT) and its Inverse (IDWT).BoT leverages the recursive nature of wavelets, using the decomposition level as a dynamic scaling factor to bridge disparate model sizes in a parameter-free and computationally efficient manner. Extensive experiments on DeiT, BERT, and GPT demonstrate significant pre-training FLOPs savings (up to 67.1% for S2L, 52.8% for L2S) and state-of-the-art performance on benchmarks like GLUE and SQuAD.
PaperID: 1828,   Poster  https://arxiv.org/pdf/2603.17531    
Authors: Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang
Title: Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
Abstract: Recent advancements in diffusionbased image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we reveal a critical insight: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remain largely invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.
PaperID: 1829,   Poster  https://arxiv.org/pdf/2511.23386    
Authors: SiNan Du, JiaHao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
Title: VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and lowlevel features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
PaperID: 1830,   Poster  https://arxiv.org/pdf/2512.00300    
Authors: Rui Qian, Haozhi Cao, Tianchen Deng, TIANXIN HU, Weixiang Guo, Shenghai Yuan, Lihua Xie
Title: TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
Abstract: Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussianbased methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases.To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches.For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention.Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness.Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.
PaperID: 1831,   Poster  https://arxiv.org/pdf/2506.07847    
Authors: Hengzhi Chen, Liqian Feng, Wenhua Wu, Xiaogang Zhu, Qiuxia Wu, Lianlei Shan, Kun Hu
Title: F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
Abstract: Semantic segmentation of ultrahigh-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces com- putational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks ad- dress this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency-aware framework that decomposes UHR images into high- and low-frequency components for specialized processing. The high-frequency branch preserves full-resolution structural details, while the low-frequency branch processes downsampled inputs through dual sub-branches capturing short- and long-range dependencies. A Hybrid-Frequency Fusion mod- ule integrates these observations, guided by two novel objectives: Cross-Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross-Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state-of-the-art performance with mIoU of 80.22 and 83.39, respectively.
PaperID: 1832,   Poster  https://arxiv.org/pdf/2510.04673    
Authors: Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister
Title: Watch and Learn: Learning to Use Computers from Online Videos
Abstract: Computerusing agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.
PaperID: 1833,   Poster  https://arxiv.org/pdf/2603.17520    
Authors: Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu, Yazhou Yao, Fumin Shen
Title: PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
Abstract: Recent advances in visionlanguage models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance. Our source code is available in the supplementary material.
PaperID: 1834,   Poster  https://arxiv.org/pdf/2512.22120    
Authors: Shuoshuo Zhang, Yizhen Zhang, JINGJING FU, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
Title: See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Abstract: Large vision–language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook finegrained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
PaperID: 1835,   Poster  https://arxiv.org/pdf/2511.23075    
Authors: Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei
Title: SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Abstract: Large visionlanguage models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion.We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder.The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation.Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D, and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D.These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence.We will release code and model checkpoints to support future research.
PaperID: 1836,   Poster  https://arxiv.org/pdf/2511.16937    
Authors: Hong Gao, Jingyu Wu, Xiangkai Xu, Kangni Xie, Yunchen Zhang, Bin Zhong, Xurui Gao, Min-Ling Zhang
Title: OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
Abstract: SpatioTemporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. While Multimodal Large Language Models have shown promise, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness.To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement (FBR) annotation pipeline for high-quality labels and DeepSTG, a systematic evaluation framework quantifying dataset quality beyond superficial statistics.Evaluations reveal performance average drops of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.
PaperID: 1837,   Poster  https://arxiv.org/pdf/2603.01010    
Authors: Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers
Title: GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusionbased models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions.We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling.To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples.Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
PaperID: 1838,   Poster  https://arxiv.org/pdf/2511.21331    
Authors: Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagkatakis
Title: THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higherorder interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
PaperID: 1839,   Poster  https://arxiv.org/pdf/2506.21076    
Authors: Jiawei Zhou, Kunming Luo, Weiyu Li, Kaiyi Zhang, Yixun Liang, Jingwei Huang, Chunchao Guo, Ping Tan
Title: PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
Abstract: Pose stylization is a fundamental task across the 2D, 3D, or video fields, which aim to output a stylized image or 3D mesh with the expected pose. In the 3D domains, existing pose stylization methods typically rely on 2D foundational models to modify the pose of an image before generating the corresponding 3D assets, which limits the ability of these methods to achieve rich and precise 3D pose stylization. To address this challenge, we propose a novel paradigm for 3D pose stylization that unifies pose stylization and 3D generation within a cohesive framework. This integration minimizes the risk of cumulative errors and enhances the model's efficiency and effectiveness. In addition, instead of a 2D skeleton used in previous works, we directly utilize the 3D skeleton because it can provide a more accurate representation of 3D spatial and topological relationships, which significantly enhances the model's capacity to achieve richer and more precise pose stylization. Additionally, we establish a comprehensive data engine to create a largescale dataset that includes pairs of image-body misalignment and skeleton-body alignment. This dataset encourages 3D generative models to concurrently learn both the style of images and the pose-related 3D structures. Building on these innovations, we present PoseMaster, a unified 3D native method for stylized pose generation. Extensive experimental evaluations demonstrate that PoseMaster significantly outperforms current state-of-the-art techniques in both qualitative and quantitative assessments.
PaperID: 1840,   Poster  https://arxiv.org/pdf/2511.14208    
Authors: Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, He Sun
Title: InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Abstract: Video inverse problems such as inpainting, deblurring and superresolution are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers—leading to temporal artifacts—or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher’s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the standard VAE in video diffusion backbone with a highly efficient LeanVAE, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100x speedups over iterative video diffusion priors. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
PaperID: 1841,   Poster  https://arxiv.org/pdf/2507.14811    
Authors: Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Kingsum Chow, GANG XIONG, Shuiguang Deng
Title: SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resourceconstrained or latency-sensitive environments.Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data.However, existing PTQ methods for diffusion models often rely on manual, architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines.To address these limitations, we propose SegQuant, a deployment-aware quantization framework that adaptively combines complementary techniques to enhance cross-model versatility.SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations using a hardware-native dual-path computation, avoiding performance penalties from custom implementations, which is crucial for maintaining visual fidelity in generated outputs.SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
PaperID: 1842,   Poster  https://arxiv.org/pdf/2512.19213    
Authors: Zihao Luo, Shaohao Rui, Zhenyu Tang, Guotai Wang, Xiaosong Wang
Title: InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
Abstract: Continual selfsupervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.
PaperID: 1843,   Poster  https://arxiv.org/pdf/2602.20985    
Authors: Munish Monga, Vishal Chudasama, Pankaj Wasnik, C.V. Jawahar
Title: EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
Abstract: Realworld object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as unknown—all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.
PaperID: 1844,   Poster  https://arxiv.org/pdf/2512.09282    
Authors: Xiang Chen, Jinshan Pan, Jiangxin Dong, Jian Yang, Jinhui Tang
Title: FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
Abstract: Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pretraining data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
PaperID: 1845,   Poster  https://arxiv.org/pdf/2512.11715    
Authors: Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu
Title: EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the fullimage context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct Crisp-2M, a high-resolution (>1024) dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves state-of-the-art image similarity performance while enabling faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
PaperID: 1846,   Poster  https://arxiv.org/pdf/2512.02012    
Authors: ZHENGYANG GENG, Yiyang Lu, Zongze Wu, Eli Shechtman, Zico Kolter, Kaiming He
Title: Improved Mean Flows: On the Challenges of Fastforward Generative Models
Abstract: MeanFlow provides a principled framework for fastforward generative modeling. However, the original MeanFlow has key limitations in both the training objective and the guidance. First, the original MeanFlow prediction depends not only on the noisy state but also explicitly on the noise and data, causing the training target to drift with the network. We reformulate it as velocity prediction, predicting the instantaneous velocity solely from the noisy state and reducing it to the regression problem. Second, on the guidance side, the original MeanFlow fixes the guidance scale during training by directly learning a guided field, achieving 1NFE sampling but losing the flexibility to adjust the guidance at inference. Instead, we condition the model on guidance scale and train it on a range of guidance scales, enabling flexible guidance as diffusion/flow models in inference while preserving one-step sampling. On ImageNet 256×256, our improved MeanFlow (iMF) achieves a 1-step FID of 2.74 with a model of 118M parameters, and our largest model further pushes the 1-step FID to 1.72, establishing a new state of the art for one-step generative modeling.
PaperID: 1847,   Poster  https://arxiv.org/pdf/2603.20284    
Authors: Runze Wang, Yuxuan Song, Youcheng Cai, Ligang Liu
Title: STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
Abstract: Online 3D reconstruction from streaming inputs requires both longterm temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose STAC, a Spatio-Temporally Aware Cache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a Working Temporal Token Caching mechanism that preserves long-term informative tokens based on decayed cumulative attention scores; a Long-term Spatial Token Caching scheme that consolidates spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and a Chunk-based Multi-frame Optimization strategy that jointly optimizes consecutive frames to enhance temporal coherence and leverage GPU parallelism. Extensive experiments demonstrate that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by 8.5× and accelerating inference by a factor of 3.5×, enabling scalable and real-time 3D reconstruction in streaming settings. The code will be made publicly available upon acceptance.
PaperID: 1848,   Poster  https://arxiv.org/pdf/2603.27064    
Authors: Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I Ross, Daniel Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Aude Oliva, Rogerio Feris
Title: ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Abstract: Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language — a capability where current visionlanguage models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. A rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across our benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. We will make both ChartNet data and models publicly available.
PaperID: 1849,   Poster  https://arxiv.org/pdf/2511.17392    
Authors: Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei
Title: MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
Abstract: Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively highdimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR–CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.
PaperID: 1850,   Poster  https://arxiv.org/pdf/2604.20358    
Authors: Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, Liqiang Nie
Title: ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
Abstract: The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and errorprone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly "hard noise" (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional "small loss hypothesis". We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a "diagonal negative combination" for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.
PaperID: 1851,   Poster  https://arxiv.org/pdf/2511.21016    
Authors: Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto
Title: Gated KalmaNet: A fading memory layer through test-time ridge regression
Abstract: As efficient alternatives to softmax Attention, linear statespace models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose \ourname (\ourshortname), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. \ourshortname achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And 2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, \ourshortname shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, \ourshortname excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than 10% relative improvement over other fading memory baselines.
PaperID: 1852,   Poster  https://arxiv.org/pdf/2511.18814    
Authors: Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling, Taiping Zeng, Xiangyang Xue, Jingbo Zhang
Title: DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Abstract: Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing openset 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.
PaperID: 1853,   Poster  https://arxiv.org/pdf/2603.22054    
Authors: Wuyang Luo, Chengkaitan Chengkaitan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma
Title: FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Abstract: Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of elementdriven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures.We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which comprises a diverse set of element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both the texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level.To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information while maintaining style consistency. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, especially in preserving the structural and textural fidelity, while supporting flexible controls, such as style mixture. The model and dataset will be made publicly available.
PaperID: 1854,   Poster  https://arxiv.org/pdf/2511.15396    
Authors: Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse
Title: ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Abstract: Recent progress in selfand weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding.We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR.ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations.While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes.Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation.This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data.We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation.On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
PaperID: 1855,   Poster  https://arxiv.org/pdf/2511.18519    
Authors: Xinlin Zhuang, Yichen Li, Xiwei Liu, Haolin Yang, Yifan Lu, Ziyun Zou, Yulong Li, Huifa Li, Dongliang Chen, Qinglei Wang, Weiyang Liu, Ying Qian, Jiangming Shi, Imran Razzak
Title: CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
Abstract: Adapting CLIP to vertical domains is typically approached by novel finetuning strategies or by scaling up domain-specific datasets. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in continual pre-training (CPT)? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image–text pair a utility that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware, Newton-style alignment computed in CLIP’s end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson–Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower‑bound guarantee on the proxy’s correlation with full‑parameter alignment and by characterizing the bias–variance trade‑offs introduced by curvature mixing and JL sketching.We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the smallest performance drop under 10--30% data-retention budgets. Code, data, and model checkpoints will be released.
PaperID: 1856,   Poster  https://arxiv.org/pdf/2511.02384    
Authors: Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Rui Nie, Junyuan Gao, Jiaxing Sun, Yubin Wang, Lijun Wu, Zhenhua Huang, Jiang Wu, Qian Yu, Conghui He
Title: RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Abstract: Largescale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision Language Models (LVLMs) handle naturally. We introduce a strategy termed ``\emphBBox and Index as Visual Prompt'' (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the \textttRxnCaption-15k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics.We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
PaperID: 1857,   Poster  https://arxiv.org/pdf/2604.03696    
Authors: Zhengyu Fu, René Zurbrügg, Kaixian Qu, Marc Pollefeys, Marco Hutter, Hermann Blum, Zuria Bauer
Title: FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
Abstract: Recent work in 3D scene understanding has begun to shift from purely spatial analysis to the more complex challenge of functional scene understanding.However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scenewide interdependencies that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their uncertainties, yielding substantially better-calibrated confidence scores. To benchmark this setting, we also introduce FunThor, a synthetic dataset based on AI2THOR with part-level geometry and systematically-defined rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. We will release the code and dataset to facilitate future research.
PaperID: 1858,   Poster  https://arxiv.org/pdf/2602.22059    
Authors: Dengdi Sun, Xiaoya Zhou, Xiao Wang, Hao Si, Wanli Lyu, Jin Tang, Bin Luo
Title: NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Abstract: Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for largescale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.
PaperID: 1859,   Poster  https://arxiv.org/pdf/2603.27403    
Authors: Kai Ye, Qingtao Pan, Shuo Li
Title: Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
Abstract: Large language models (LLMs) need reliable testtime control of hallucinations. Existing conformal methods for LLMs typically provide only \emphmarginal guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emphConditional Factuality Control (CFC), a black-box conformal framework that returns \emphset-valued outputs with \emphconditional coverage guarantees. CFC learns a continuous, feature-conditional acceptance threshold via augmented quantile regression on a latent ``success'' score (the best score among correct candidates), and uses it to filter samples at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emphefficiency, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most O(\sqrt\log(1/\delta)/N). Empirically, on synthetic data and real-world reasoning and QA benchmarks, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.
PaperID: 1860,   Poster  https://arxiv.org/pdf/2604.02905    
Authors: Geonuk Kim, Minhoi Kim, Kangil Lee, Minsu Kim, Hyeonseong Jeon, JEONGHOON HAN, Hyoungjoon Lim, Junho Yim
Title: UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
Abstract: Even though industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closedset assumption, which prevents them from detecting novel anomalies. While the visual prompting approach provides a scalable alternative, it struggles in industrial settings where subtle inter-class differences and high intra-class variance make prompt-to-region matching ambiguous and cause prompt embeddings to collapse, limiting the effectiveness of existing methods. To address these challenges, we introduce UniSpector— a Universal Inspector for open-set defect detection and segmentation. To empower defect prompt embeddings for robust recognition of novel defects, it comprises two key components: the Spatial–Spectral Prompt Encoder (SSPE) and the Contrastive Prompt Encoder (CPE). SSPE extracts orientation-invariant frequency cues and fuses them with spatial features to distinguish subtle defects. CPE encodes the prompt into an angular space to facilitate semantically meaningful embedding of unseen defect prompts. In addition, to improve adaptability to novel defect types, we introduce Prompt-guided Query Selection (PQS) to generate adaptive object queries aligned with the prompt.To standardize evaluation, we introduce Inspect Anything (InsA), the first benchmark for visual-prompt-based open-set defect localization.Experiments demonstrate that UniSpector significantly surpasses prior baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enables a scalable, retraining-free inspection paradigm for continuously evolving industrial environments.
PaperID: 1861,   Poster  https://arxiv.org/pdf/2603.23914    
Authors: Fatih Ilhan, Gaowen Liu, Ramana Kompella, Selim Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu
Title: Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
Abstract: Large VisionLanguage Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, compared to the existing representative decoding optimization methods, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack with eviction, quantization and kernel fusion, showing further memory efficiency gains in resource-limited environments.
PaperID: 1862,   Poster  https://arxiv.org/pdf/2511.20886    
Authors: Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Paudel, Yuqian Fu
Title: V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Abstract: Crossview object correspondence, exemplified by the representative task of ego–exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego–exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (Ego–Exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence). Codes will be released upon acceptance.
PaperID: 1863,   Poster  https://arxiv.org/pdf/2603.04896    
Authors: Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang
Title: Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Abstract: The rapid adoption of visionlanguage models (VLMs) in visual recognition and multimodal reasoning has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on predefined and static authorized domain during training, limiting flexibility in dynamic real-world environments. In addition, they often produce opaque and unsafe responses to unauthorized inputs, lacking explicit alerts for illegal usage.To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.
PaperID: 1864,   Poster  https://arxiv.org/pdf/2603.02919    
Authors: Youngjun Jun, seil kang, Woojung Han, Seong Jae Hwang
Title: Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
Abstract: Video Diffusion Transformers (DiTs) have been synthesizing highquality video with high fidelity to text descriptions involving motion. However, the understanding of how Video DiTs convert motion words into video remains lagging behind. Furthermore, prior studies on interpretable saliency maps primarily target objects, leaving it behind to observe how Video DiTs behave with respect to motion. In this paper, we inquire into concrete motion features that specify which object moves and at what time for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively renders per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose an automatic motion-feature selecting algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motions spatially and temporally. Our methods discover concept saliency maps without the need for any gradient-based training or parameters. Experimentally, our methods show standout localization capability in the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.
PaperID: 1865,   Poster  https://arxiv.org/pdf/2603.06732    
Authors: Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu
Title: HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Abstract: Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closedvocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks—Charades-OV and ActivityNet-OV—that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO (Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
PaperID: 1866,   Poster  https://arxiv.org/pdf/2511.22586    
Authors: Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Xin Zhao, Youbin Wu, Ji-Rong Wen
Title: Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Abstract: We study how different Chainof-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as ''think with image'', has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. \ignoreHowever, it is costly to construct or synthesize and may contain a complicated format that increases the risk of incorrect intermediate steps, hurting downstream reinforcement learning (RL). To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a ``short is long'' effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning. Our code and data will be publicly released.
PaperID: 1867,   Poster  https://arxiv.org/pdf/2510.15440    
Authors: Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang
Title: Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Abstract: Longform video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
PaperID: 1868,   Poster  https://arxiv.org/pdf/2603.27666    
Authors: Yuhe Liu, Zhenxiong Tan, Yujia Hu, Songhua Liu, Xinchao Wang
Title: Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
Abstract: Recent advances in diffusionbased controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively harmonizes multi-type conditional inputs, such as image, semantic, and spatial cues, while maintaining training stability. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability. Codes will be available.
PaperID: 1869,   Poster  https://arxiv.org/pdf/2503.09242    
Authors: Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin
Title: NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
Abstract: Flowbased Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024×1024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.
PaperID: 1870,   Poster  https://arxiv.org/pdf/2512.24160    
Authors: TsaiChing Ni, ZhenQi Chen, YuanFu Yang
Title: Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
Abstract: We present IMDD1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.
PaperID: 1871,   Poster  https://arxiv.org/pdf/2602.20981    
Authors: Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji
Title: Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and framelevel video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
PaperID: 1872,   Poster  https://arxiv.org/pdf/2512.19150    
Authors: Ruikai Li, Xinrun Li, Mengwei Xie, Hao Shan, Shoumeng Qiu, Xinyuan Chang, Yizhe Fan, Feng Xiong, Han Jiang, Yilong Ren, Haiyang Yu, Mu Xu, Yang Long, Varun Ojha, Zhiyong Cui
Title: AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
Abstract: Online HighDefinition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently "spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a "distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with "look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of current-frame inference.
PaperID: 1873,   Poster  https://arxiv.org/pdf/2512.22522    
Authors: Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng
Title: Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks
Abstract: Spiking Neural Networks (SNNs) utilize spikebased activations to mimic the brain's energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the L_\infty constraint—Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.
PaperID: 1874,   Poster  https://arxiv.org/pdf/2512.05103    
Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
Title: TV2TV: A Unified Framework for Interleaved Language and Video Generation
Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated highlevel reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before "acting in pixels" to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality (preferred 92% of the time in human evaluations vs. a comparable text-to-video model) and controllability (19 point improvement in fine-grained instruction following accuracy vs. a "think-then-act" approach). TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
PaperID: 1875,   Poster  https://arxiv.org/pdf/2603.14243    
Authors: Haoxuan Xu, Guanglin Niu
Title: BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification
Abstract: VisibleInfrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.
PaperID: 1876,   Poster  https://arxiv.org/pdf/2511.10150    
Authors: Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao, Shu Hu
Title: Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairnessenhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.
PaperID: 1877,   Poster  https://arxiv.org/pdf/2603.25188    
Authors: Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang, Caixia Yan, Bing Deng, Jieping Ye
Title: AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Abstract: Identitypreserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption introduces two significant limitations: it curtails creative flexibility by poorly accommodating diverse, real-world input formats, and more critically, it compromises identity fidelity. Relying on a single source is an ill-posed setting, and provides an inherently ambiguous foundation, making it difficult for the model to faithfully reproduce an identity across novel contexts. In response, we present AnyID, an ultra-fidelity identity-preservation video generation framework. Our approach makes two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. The model is trained on a large-scale, meticulously curated dataset to ensure robustness and high fidelity. In addition, we perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings. All the codes, data and models will be publicly released.
PaperID: 1878,   Poster  https://arxiv.org/pdf/2512.03245    
Authors: Liying Lu, Raphael Achddou, Sabine Süsstrunk
Title: 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Abstract: Raw images taken in lowlight conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only one single noisy image and one single dark frame per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to concurrent approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. It is accurate and practical. Moreover, our synthesis method leads to state-of-the-art performances on multiple low-light denoising benchmarks.
PaperID: 1879,   Poster  https://arxiv.org/pdf/2603.29692    
Authors: Ning Wang, Tieyue Wu, Naeha Sharif, Farid Boussaid, Guangming Zhu, Lin Mei, Mohammed Bennamoun, Liang Zhang
Title: SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
Abstract: Zeroshot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions. Our project is available at:
PaperID: 1880,   Poster  https://arxiv.org/pdf/2512.24731    
Authors: Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu
Title: EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video–text–to–audio (VT2A), the current formulation faces three key limitations: (1) an imbalance between visual and textual conditioning that leads to visual dominance; (2) the absence of a concrete definition for finegrained controllable generation; (3) weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley (Event-Centric Hierarchical cOntrol Foley), a new task designed for video-grounded sound generation with both event-level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video–instruction–annotation triplets and 42,000 fine-grained sounding event annotations.Building upon this foundation, we propose EchoVidia, a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.
PaperID: 1881,   Poster  https://arxiv.org/pdf/2510.05613    
Authors: Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao
Title: PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Abstract: Autoregressive point cloud generation has long lagged behind diffusionbased approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias reinforces short-range continuity but limits the model’s ability to capture long-range dependencies, thereby weakening its capacity to enforce global structural properties such as symmetry, geometric consistency, and large-scale spatial regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Strictly following the baseline experimental setups, empirical results on ShapeNet benchmark demonstrate that PointNSP achieves state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. Moreover, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, under dense generation with 8,192 points, PointNSP's advantages become even more pronounced, highlighting its strong scalability potential.
PaperID: 1882,   Poster  https://arxiv.org/pdf/2602.21736    
Authors: Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu
Title: Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
Abstract: Despite progress, VisionLanguage-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstructing, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending labotatory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improves downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
PaperID: 1883,   Poster  https://arxiv.org/pdf/2602.05449    
Authors: Chang Zou, Changlin Li, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
Title: DisCa: Accelerating Video Diffusion Transformers with DistillationCompatible Learnable Feature Caching
Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden.Among the existing acceleration methods, Feature Caching is popular due to its trainingfree property and considerable speedup performance,but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to 11.8× while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
PaperID: 1884,   Poster  https://arxiv.org/pdf/2512.12622    
Authors: Zihan Wang, Seungjun Lee, Guangzhao Dai, Gim Hee Lee
Title: D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation
Abstract: Embodied agents face a critical dilemma that endto-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.
PaperID: 1885,   Poster  https://arxiv.org/pdf/2512.23042    
Authors: Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M Asano
Title: 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
Abstract: Despite recent progress in 3D selfsupervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce \data, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.
PaperID: 1886,   Poster  https://arxiv.org/pdf/2604.15239    
Authors: Jiawei Ren, Michal Tyszkiewicz, Jiahui Huang, Žan Gojčič
Title: TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Abstract: In this work, we revisit several key design choices of modern Transformerbased approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, therebyunbindingthe number of predicted primitives from input image resolution and number of views. Our resulting method,TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.
PaperID: 1887,   Poster  https://arxiv.org/pdf/2512.01773    
Authors: Chenghao Gu, Haolan Kang, Junchao Lin, Jinghe Wang, Duo Wu, Shuzhao Xie, Fanding Huang, Junchen Ge, Ziyang Gong, Letian Li, Hongying Zheng, Changwei Lv, Zhi Wang
Title: IGen: Scalable Data Generation for Robot Learning from Open-World Images
Abstract: The rise of generalist robotic policies has created an exponential demand for largescale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training. Code for IGen will be made publicly available.
PaperID: 1888,   Poster  https://arxiv.org/pdf/2603.16671    
Authors: Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang GONG, Jingao Xu, Xinlei Chen
Title: $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
Abstract: Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modalityspecific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex. Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce x^2-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation. Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that x^2-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.
PaperID: 1889,   Poster  https://arxiv.org/pdf/2602.18863    
Authors: Abdullah All Tanvir, Agnibh Dasgupta, Xin Zhong
Title: TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
Abstract: Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moiré interference, that remain challenging for deep watermarking systems. We present TIACam, a textanchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moiré operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.
PaperID: 1890,   Poster  https://arxiv.org/pdf/2603.19623    
Authors: Chunlei Zhang, Jiahao Xia, Yun Xiao, Bo Jiang, Jian Zhang
Title: Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
Abstract: Multimodal image registration is a fundamental task for multimodal imagery and a prerequisite for downstream crossmodal analysis. Despite recent progress with shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, so modality-private cues can still leak into the shared space. Second, most multi-scale frameworks support only one transformation type, which limits their applicability in real-world scenarios where global misalignment and local deformation coexist.To address these issues, we view hybrid multimodal registration as jointly constructing a stable shared feature space and a unified hybrid transformation within that space. Building on this perspective, we introduce HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) produces multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues across scales and projects the shared component into a stable subspace suited for correspondence. On top of this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative, coarse-to-fine estimation of both global rigid parameters and multi-scale fine-grained deformation fields, which are fused into a single coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on both rigid and non-rigid registration tasks. Code will be made publicly available.
PaperID: 1891,   Poster  https://arxiv.org/pdf/2601.03252    
Authors: Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng
Title: InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Abstract: Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitraryresolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts. Code and data will be made publicly available.
PaperID: 1892,   Poster  https://arxiv.org/pdf/2507.05914    
Authors: Rui Huang, Shitong Shao, zikai zhou, Pukun Zhao, Hangyu Guo, Tian Ye, Lichen Bai, Shuo Yang, Zeke Xie
Title: Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective
Abstract: Diffusion models have achieved remarkable performance on a wide range of generative tasks, yet training them from scratch is notoriously resourceintensive, typically requiring millions of training images and many GPU days. Motivated by a data-centric view of this bottleneck, we adopt a condensation-based perspective: given a large training set, the goal is to construct a much smaller condensed dataset that still supports training strong diffusion models under minimal data and compute budgets. To operationalize this perspective, we introduce Diffusion Dataset Condensation (D^2C), a two-phase framework comprising Select and Attach. In the Select phase, a diffusion difficulty score combined with interval sampling is used to identify a compact, informative training subset from the original data. Building on this subset, the Attach phase further strengthens the conditional signals by augmenting each selected image with rich semantic and visual representations. To our knowledge, D^2C is the first framework that systematically investigates dataset condensation for diffusion models, whereas prior condensation methods have mainly targeted discriminative architectures. Extensive experiments across data budgets (0.8%–8% of ImageNet), model architectures, and image resolutions demonstrate that D^2C dramatically accelerates diffusion model training while preserving high generative quality. On ImageNet 256^2 with SiT-XL/2, D^2C attains a FID of 4.3 in just 40k steps using only 0.8% of the training images, corresponding to about 233x and 100x faster training than vanilla SiT-XL/2 and SiT-XL/2 + REPA, respectively.
PaperID: 1893,   Poster  https://arxiv.org/pdf/2508.04332    
Authors: Xinkui Zhao, Yifan Zhang, Sai Liu, Naibo Wang, Guanjie Cheng, Yueshen Xu, Chang Liu, Shuiguang Deng, Jianwei Yin
Title: DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux
Abstract: Embodied MultiAgent Systems have proven highly effective in addressing complex tasks through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, exhibiting frequent changes, uncertainty, and variability. Despite these characteristics, most existing frameworks employ static architectures with fixed agent capabilities and rigid task allocation strategies, which substantially constrain their adaptability to evolving conditions. This inflexibility presents significant challenges to maintaining robust and efficient multi-agent cooperation in dynamic and unpredictable settings.To address these limitations, we propose DRAMA, short for Dynamic Orchestration for Resilient Multi-Agent Ecosystems, tailored for rapidly changing environments. DRAMA adopts a multilayer architecture that incorporates three principal mechanisms: adaptive scheduling through an affinity-driven mechanism, fault-tolerant continuity via hierarchical trust-chain task takeover, and collective spatial intelligence that consolidates distributed observations for predictive reasoning. Together, these components enable event-triggered rescheduling and decentralized fault recovery, ensuring uninterrupted task execution amid agent arrivals, dropouts, or recoveries. Extensive experiments in the embodied VirtualHome-Social environment demonstrate that DRAMA achieves a 7% improvement in runtime efficiency and a 10% increase in throughput compared with state-of-the-art baselines, while maintaining superior stability and robustness under dynamic agent populations.
PaperID: 1894,   Poster  https://arxiv.org/pdf/2601.09823    
Authors: Subhajit Sanyal, Srinivas Soumitri Miriyala, Akshay Bankar, Manjunath Arveti, Sowmya Vajrala, Shreyas Pandith, Sravanth Kodavanti, Abhishek Ameta, Harshit Harshit, Amit Unde
Title: NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising UNet or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder–decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy–latency–size frontier (e.g., 130M–315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.
PaperID: 1895,   Poster  https://arxiv.org/pdf/2511.21760    
Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince Calhoun
Title: fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop crossmodal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI–text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
PaperID: 1896,   Poster  https://arxiv.org/pdf/2603.05582    
Authors: Ivan Luiz De Moura Matos, Abdel Djalil SAD SAOUD, Ekaterina Iakovleva, Vito Paolo Pastore, Enzo Tartaglione
Title: Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
Abstract: The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and biasagnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates "bias-free" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate "as is", without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model. The code will be published upon acceptance.
PaperID: 1897,   Poster  https://arxiv.org/pdf/2603.17390    
Authors: QINGRAN LIN, Fengwei Yang, Chaolun Zhu
Title: Harnessing the Power of Foundation Models for Accurate Material Classification
Abstract: Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and realworld applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations:(a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance.
PaperID: 1898,   Poster  https://arxiv.org/pdf/2503.07853    
Authors: Depanshu Sani, Saket Anand
Title: Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
Abstract: Traditional classifiers treat all class labels as mutually independent, thereby considering all negative classes to be equally incorrect. This approach fails severely in many realworld scenarios, where a known semantic hierarchy defines a partial order of preferences over negative classes. While hierarchy-aware feature representations have shown promise in mitigating this problem, their performance is typically assessed using metrics like Mistake Severity (MS) and Average Hierarchical Distance (AHD). In this paper, we highlight important shortcomings in existing hierarchical evaluation metrics, demonstrating that they are often incapable of measuring true hierarchical performance. Our analysis reveals that existing methods learn sub-optimal hierarchical representations, despite competitive MS and AHD scores. To counter these issues, we introduce Hierarchical Composition of Orthogonal Subspaces (Hier-COS), a novel framework for unified 'hierarchy-aware fine-grained' and 'hierarchical multi-label' classification. We show that Hier-COS is theoretically guaranteed to be consistent with the given hierarchy tree. Furthermore, our framework implicitly adapts the learning capacity for different classes based on their position within the hierarchy tree — a vital property absent in existing methods. Finally, to address the limitations of evaluation metrics, we propose Hierarchically Ordered Preference Score (HOPS), a ranking-based metric that demonstrably overcomes the deficiencies of current evaluation standards. We benchmark Hier-COS on four challenging datasets, including the deep and imbalanced tieredImageNet-H (12-level) and iNaturalist-19 (7-level). Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art performance across all hierarchical metrics for every dataset, while simultaneously beating the top-1 accuracy in all but one case. Lastly, we show that Hier-COS can effectively learn to transform the frozen features extracted from a pretrained backbone (ViT) to be hierarchy-aware, yielding substantial benefits for hierarchical classification performance.
PaperID: 1899,   Poster  https://arxiv.org/pdf/2603.12918    
Authors: Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, DongWan Kang, Hyun Myung
Title: VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
Abstract: Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, crossview pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.
PaperID: 1900,   Poster  https://arxiv.org/pdf/2602.17558    
Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao
Title: RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending visionlanguage reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable Lightroom adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
PaperID: 1901,   Poster  https://arxiv.org/pdf/2512.24461    
Authors: Seohui Bae, Jeonghye Kim, Youngchul Sung, Woohyung Lim
Title: Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
Abstract: In this paper, we propose a testtime adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.
PaperID: 1902,   Poster  https://arxiv.org/pdf/2602.21655    
Authors: Zhijiang Tang, Linhua Wang, JIAXIN QI, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang
Title: CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Abstract: Image captioning remains a fundamental task for vision–language understanding, yet groundtruth supervision still relies predominantly on human-annotated references.Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models.We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?).To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate Complete and Correct Captions.For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency.For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition.Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
PaperID: 1903,   Poster  https://arxiv.org/pdf/2602.20618    
Authors: Haonan An, Xiaohui Ye, Guang Hua, Yihang Tao, Hangcheng Cao, Xiangyu Yu, Yuguang Fang
Title: RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
Abstract: The proliferation of AIgenerated content (AIGC) has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property (IP). In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness.To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution (ID) and out-of-distribution (OOD) data. Code will be released upon acceptance.
PaperID: 1904,   Poster  https://arxiv.org/pdf/2512.10652    
Authors: Jian-Yu Jiang-Lin, Kang-Yang Huang, LING ZOU, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng
Title: TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such persondriven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce \benchname, a comprehensive benchmark for interpretable DeepFake detection. \benchname\ contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. \benchname\ provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
PaperID: 1905,   Poster  https://arxiv.org/pdf/2507.10355    
Authors: Bo Jiang, Xueyang Ze, Beibei Wang, Xixi Wang, Xixi Wan, Bin Luo
Title: Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
Abstract: Textual adapterbased tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graphbased adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.
PaperID: 1906,   Poster  https://arxiv.org/pdf/2511.22184    
Authors: Daniel Jung, Kyoung Mu Lee
Title: Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
Abstract: Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zerovelocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.
PaperID: 1907,   Poster  https://arxiv.org/pdf/2512.02014    
Authors: Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Pérez-Rúa, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong
Title: TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Abstract: Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows endto-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
PaperID: 1908,   Poster  https://arxiv.org/pdf/2603.29773    
Authors: Fengyang Xiao, Peng Hu, Lei Xu, XingE Guo, Guanyi Qin, Yuqi Shen, Chengyu Fang, Rihan Zhang, Chunming He, Sina Farsiu
Title: Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
Abstract: Realworld image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable.To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)—extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models—to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms:(1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and(2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces.Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code will be released.
PaperID: 1909,   Poster  https://arxiv.org/pdf/2603.12997    
Authors: Chen Feng, Zhuo ZHI, Zhao Huang, Jiawei Ge, Ling Xiao, Nicu Sebe, Georgios Tzimiropoulos, Ioannis Patras
Title: Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
Abstract: Statistically consistent methods based on the noise transition matrix (T) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal cleandata classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating T. The common assumption is that, given a perfect T, noise-correction methods would recover their theoretical advantage.In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a \emphperfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a T-estimation problem, but stems from a more deeply rooted flaw.To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.
PaperID: 1910,   Poster  https://arxiv.org/pdf/2505.04594    
Authors: Zhihao Zhang, Abhinav Kumar, Girish Ganesan Ganesan, Xiaoming Liu
Title: Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Abstract: Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image.Without auxiliary sensors such as LiDAR, this task is inherently illposed since the 3D-to-2D projection introduces depth ambiguity.Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection.However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors.Therefore, neither parallel nor sequential prediction is optimal.In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs.A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty.By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.
PaperID: 1911,   Poster  https://arxiv.org/pdf/2601.09111    
Authors: Li Yang, Aming Wu, Zihao Zhang, Yahong Han
Title: Towards Open Environments: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
Abstract: VisionLanguage Navigation (VLN) aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods.To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework.The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory.The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning, it continually improves the accuracy and generalization ability of fast decisions. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios. Extensive experiments demonstrate the superiorities of our method.
PaperID: 1912,   Poster  https://arxiv.org/pdf/2604.15670    
Authors: shuyan ke, Yifan Mei, Changli Wu, yonghan zheng, Jiayi Ji, Liujuan Cao, Rongrong Ji
Title: PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Abstract: Reasoning segmentation has recently expanded from groundlevel scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first large-scale UAV reasoning segmentation benchmark, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision covering all three reasoning types. We further propose PixDLM, a pixel-level multimodal language model equipped with a Dual-Path Vision Encoder that preserves fine-grained high-resolution cues while maintaining strong global semantic alignment. Extensive experiments on DRSeg demonstrate that PixDLM achieves superior semantic consistency and spatial localization accuracy compared with existing multimodal models, offering a unified and efficient baseline for UAV reasoning segmentation. All datasets, models, and code will be released.
PaperID: 1913,   Poster  https://arxiv.org/pdf/2510.00507    
Authors: Yurun Chen, Xueyu Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi qian, Bo Zheng, Keting Yin, Shengyu Zhang
Title: Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Abstract: As multimodal LLMdriven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta-path strategies. To further ensure task reliability, a multi-stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi-step web interaction tasks. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval-Bench effectively distinguishes agent performance, offering a new perspective on automated agent evaluation.
PaperID: 1914,   Poster  https://arxiv.org/pdf/2603.17693    
Authors: Sontao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu
Title: Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Abstract: The transition from image to video understanding requires visionlanguage models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives (state tracking, retrodictive inference), constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1's 165K real-world samples. We attribute this to fundamental temporal skills—such as tracking frame-by-frame changes and comparing velocity—that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: systematic video temporal learning through carefully designed synthetic data provides a more cost-efficient scaling path.
PaperID: 1915,   Poster  https://arxiv.org/pdf/2512.14870    
Authors: Dan Ben Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
Title: HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Abstract: Video Large Language Models (VideoLLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. In this direction, we present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question is constructed to require aggregating at least three non-overlapping evidential cues across distinct video segments (so neither language priors nor a single snapshot can suffice). HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS)-the smallest number of frames a model must fuse to answer correctly-and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31–42% are only slightly above the 20% random-guess rate. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided.By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.
PaperID: 1916,   Poster  https://arxiv.org/pdf/2604.01634    
Authors: Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo
Title: CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Abstract: Realworld reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image–text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other multi-image benchmarks.
PaperID: 1917,   Poster  https://arxiv.org/pdf/2506.17212    
Authors: Tianjiao Yu, Vedant Shah, Muntasir Wahed, Ying Shen, Kiet A. Nguyen, Ismini Lourentzou
Title: Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
Abstract: Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part^2GS, a novel framework for modeling articulated digital twins of multipart objects with high-fidelity geometry and physically consistent articulation. Part^2GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part^2GS consistently outperforms state-of-the-art methods by up to 10× in Chamfer Distance for movable parts.
PaperID: 1918,   Poster  https://arxiv.org/pdf/2503.14295    
Authors: baiqin wang, Xiangyu Zhu, Fan Shen, HAO XU, Zhen Lei
Title: PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
Abstract: Recent advancements in audiodriven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over talking face, such as speaking style and emotional expression, resulting in uniform facial motion. In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control ensures accurate lip-sync across varied speaking styles to simulate different talking habits, whereas emotion control aims to generate realistic emotional expressions with varying intensities and mixed emotional states. To achieve precise facial animation control, we propose a novel and efficient framework, PC-Talk, which enables lip-audio alignment control and emotion control through implicit keypoint deformations. First, our LAC module generates lip-synced talking faces with a specific speaking style, derived from either a video reference or preset options. It also supports lip movement scale adjustment and fine-grained editing of speaking styles for specific articulations. Second, our EMC module produces vivid emotional facial expressions through pure emotional deformation. It further enables precise control over emotion intensity and the compound emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on HDTF and MEAD datasets in experiments. The code will be publicly available.
PaperID: 1919,   Poster  https://arxiv.org/pdf/2601.01593    
Authors: Haonan Cai, Yuxuan Luo, Zhouhui Lian
Title: Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
Abstract: Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Fewshot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.
PaperID: 1920,   Poster  https://arxiv.org/pdf/2511.22863    
Authors: Fengyi Fang, Sicheng Yang, Wenming Yang
Title: CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Abstract: Cospeech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches. Code and demo video are in the supplementary material and will be released upon paper acceptance.
PaperID: 1921,   Poster  https://arxiv.org/pdf/2602.00267    
Authors: Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li
Title: PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studiolevel multi-object compositing. This task demands simultaneous (i) near‑perfect preservation of each item’s identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model’s temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.
PaperID: 1922,   Poster  https://arxiv.org/pdf/2511.18719    
Authors: Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li
Title: Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
Abstract: Reinforcement learning (RL) has become a powerful tool for posttraining visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.
PaperID: 1923,   Poster  https://arxiv.org/pdf/2511.20889    
Authors: Taehoon Kim, Henry Gouk, Timothy Hospedales
Title: Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
Abstract: Testtime alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward).Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters.Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.
PaperID: 1924,   Poster  https://arxiv.org/pdf/2603.05295    
Authors: Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, LI LING, Yanyi Shang, Dehan Kong
Title: WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
Abstract: We introduce WebChain, the largest opensource dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.
PaperID: 1925,   Poster  https://arxiv.org/pdf/2603.26299    
Authors: Wooseong Jeong, Wonyoung Lee, Kuk-Jin Yoon
Title: Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
Abstract: Merging multiple LowRank Adaptation (LoRA) modules into a single model is a promising approach for constructing general-purpose systems, but it remains challenging because low-rank update directions introduced by LoRA adapters often span different subspaces and contribute unevenly across directions. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model’s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We then propose TARA-Merging, short for Task-Rank Anisotropy Alignment. It explicitly incorporates task preferences by aligning the merging weights with a preference-weighted cross-entropy pseudo loss with preserving LoRA directions that encode task-relevant subspaces. This alignment ensures that the merged model maintains broad subspace coverage and accounts for anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.
PaperID: 1926,   Poster  https://arxiv.org/pdf/2604.07723    
Authors: Jiahao Li, Yang Lu, Yachao Zhang, FangyongWang FangyongWang, Yuan Xie, Yanyun Qu
Title: Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
Abstract: Openvocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.
PaperID: 1927,   Poster  https://arxiv.org/pdf/2505.22337    
Authors: Samara Ghrer, Christophe Godin, Stefanie Wuhrer
Title: Learning to Infer Parameterized Representations of Plants from 3D Scans
Abstract: Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of selfocclusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We quantitatively evaluate our approach on Chenopodium Album plants on reconstruction, segmentation and skeletonization, which are important problems in plant phenotyping. In addition to carrying out several tasks at once, our method achieves results on-par with strong baselines for each task. We apply our method, trained exclusively on synthetic data, to 3D scans and show that it generalizes well.
PaperID: 1928,   Poster  https://arxiv.org/pdf/2511.15618    
Authors: Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao, Le Wan, Yuwang Wang, Ronggang Wang, Shengfeng He
Title: FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
Abstract: Autoregressive models can generate highquality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications.We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.Extensive experiments show that FlashMesh achieves up to a 2× speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
PaperID: 1929,   Poster  https://arxiv.org/pdf/2512.23333    
Authors: Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, WeitaoJia WeitaoJia, Xiaodong Ge, Jingqun Tang, Benlei Cui, Bin Li, Xiangyang Xue
Title: CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
Abstract: ComputerAided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods, such as 3D reconstruction from sketches, often produce non-editable, approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model’s ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
PaperID: 1930,   Poster  https://arxiv.org/pdf/2511.03334    
Authors: Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, qinglin lu, Limin Wang
Title: UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Abstract: Due to the lack of effective crossmodal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for human-centric joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation (FAM) module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance (MA-CFG), a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables the seamless unification of pivotal audio-visual tasks within a single model. Furthermore, we demonstrate that joint multi-task training can further boost the performance of joint generation. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
PaperID: 1931,   Poster  https://arxiv.org/pdf/2603.25159    
Authors: SuYeon Kim, Wongyu Lee, MyeongAh Cho
Title: A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
Abstract: 3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from InterCategory Entanglement (ICE)—where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.
PaperID: 1932,   Poster  https://arxiv.org/pdf/2603.22758    
Authors: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
Title: ReconstructionGuided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
Abstract: Video Object‑Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot‑attention models often suffer from severe over‑fragmentation.This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots.We tackle this limitation with a reconstruction‑guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub‑parts can emerge only if coarse‑level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry.Therefore, we augment MSE with a structure‑aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries.Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames.All combined, SlotCurri addresses object overfragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference.Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri.
PaperID: 1933,   Poster  https://arxiv.org/pdf/2604.10573    
Authors: bo zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang
Title: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Abstract: Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multiview images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs.Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field.Finally, to enforce geometric–semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by reprojecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency and resolving geometry–semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.
PaperID: 1934,   Poster  https://arxiv.org/pdf/2604.15663    
Authors: Jiahui Geng, Qing Li, Fengyu Cai, Fakhri Karray
Title: CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
Abstract: Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrievalaugmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding.Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML.To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, and show through extensive evaluation the task is challenging.Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment.CodeMMR achieves strong generalization across modalities and languages, outperforming competitive baselines (e.g., UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10.Moreover, integrating CodeMMR into RAG enhances code generation fidelity and visual grounding on unseen code generation tasks, underscoring the potential of multimodal retrieval as a core enabler for next-generation intelligent programming systems.
PaperID: 1935,   Poster  https://arxiv.org/pdf/2509.25210    
Authors: Hao Chen, Tao Han, Jie ZHANG, Song Guo, Lei Bai
Title: STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
Abstract: To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physicsbased methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model’s ability to capture temporal patterns. Beyond global and regional forecasting, STCast is evaluated on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over state-of-the-art methods across all four tasks.
PaperID: 1936,   Poster  https://arxiv.org/pdf/2602.05202    
Authors: Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang
Title: GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on VisionLanguage Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: 6× to 65× fewer than existing VLM-based approaches.
PaperID: 1937,   Poster  https://arxiv.org/pdf/2603.26385    
Authors: I-Hsiang (Aaron) Chen, Isma Hadji, Enrique Sanchez, Adrian Bulat, Sy-Yen Kuo, Radu Timofte, Georgios Tzimiropoulos, Brais Martinez
Title: Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
Abstract: Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an allin-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.
PaperID: 1938,   Poster  https://arxiv.org/pdf/2505.17476    
Authors: Yuchen Zhang, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng
Title: The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
Abstract: The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AIgenerated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework's superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions. In cross-domain testing on the MDSM dataset, AMD achieves the best average performance, with ACC, mAP, and mIoU scores of 88.18, 60.25, and 61.02, respectively.
PaperID: 1939,   Poster  https://arxiv.org/pdf/2512.03345    
Authors: Seunghoi Kim, Henry Tregidgo, Chen Jin, Matteo Figini, Daniel C. Alexander
Title: HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
Abstract: Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safetycritical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors.Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective.We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36).Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation.We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures.Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
PaperID: 1940,   Poster  https://arxiv.org/pdf/2603.06014    
Authors: Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, qinglin lu, Jing Liao
Title: EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
Abstract: Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing highquality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX.In this work, we present EffectMaker, a unified reasoning–generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about their adaptation to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic–visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning.Furthermore, we construct EffectData, a largest and high-quality synthetic dataset containing 100K videos across 2K VFX categories, to enhance generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Code and data will be released upon acceptance.
PaperID: 1941,   Poster  https://arxiv.org/pdf/2602.19945    
Authors: Jin Liu, Ning Xi, Yinbin Miao, Junkang Liu
Title: DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Abstract: Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). While AdamW accelerates training and finetuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplify AdamW’s sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift.Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter (\varepsilon,\delta)-DP guarantees.Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base, \varepsilon=1), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83%. Thecode is available in Appendix.
PaperID: 1942,   Poster  https://arxiv.org/pdf/2506.17218    
Authors: Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan
Title: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Abstract: Visionlanguage models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery—the internal construction and manipulation of visual cues—we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as “Mirage”, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to “think visually”, it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that \Model unlocks stronger multimodal reasoning without explicit image generation.
PaperID: 1943,   Poster  https://arxiv.org/pdf/2508.03100    
Authors: Yogesh Kulkarni, Pooyan Fazli
Title: AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Abstract: Multimodal reasoning over longhorizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning.AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating 5× sample efficiency, requiring 80% fewer generated completions to reach target performance.
PaperID: 1944,   Poster  https://arxiv.org/pdf/2505.11192    
Authors: Myunsoo Kim, Seong-Woong Shim, Byung-Jun Lee
Title: FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
Abstract: False negatives pose a critical challenge in visionlanguage pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.
PaperID: 1945,   Poster  https://arxiv.org/pdf/2511.17943    
Authors: Zhiyu Xu, Weilong Yan, YUFEI SHI, Xin Meng, Tao He, Huiping Zhuang, Ming Li, Hehe Fan
Title: SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
Abstract: Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating—a domain that demands external professional knowledge integration and rigorous stepwise reasoning—existing approaches often struggle. To bridge this gap, we propose SciEducator, an iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan–Do–Study–Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.
PaperID: 1946,   Poster  https://arxiv.org/pdf/2603.25199    
Authors: Peng Wen, Yuting Wang, Qiurui Wang
Title: TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
Abstract: Current football imitation research primarily aims to optimize rewardbased objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full-team behaviors, enabling both quantitative and visual assessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm establishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football. The benchmark will be soon public.
PaperID: 1947,   Poster  https://arxiv.org/pdf/2511.18919    
Authors: Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang
Title: Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for posttraining visual generative models. However, its performance is fundamentally limited by the ambiguity of textual–visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores.Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
PaperID: 1948,   Poster  https://arxiv.org/pdf/2508.07901    
Authors: Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, Jing LYU
Title: Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Abstract: Generating highfidelity human videos that match user-specified identities is important yet challenging in the field of generative AI.Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools.In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation.Specifically, we introduce a conditional image branch into the pre-trained video generation model.Identity control is achieved through restricted self-attentions with conditional position mapping.Thanks to these designs, which greatly preserve the pretrained prior of the video generation model, our approach is able to outperform other full-parameter training methods in video quality and identity preservation, even with just ~1% additional parameters and only 2000 training pairs.Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.Code and dataset will be available to the community.
PaperID: 1949,   Poster  https://arxiv.org/pdf/2512.02993    
Authors: Yifei Zeng, Bao Yajie, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao
Title: TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
Abstract: Prevailing 3D texture generation methods, which often rely on multiview fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.
PaperID: 1950,   Poster  https://arxiv.org/pdf/2602.19170    
Authors: Kanglei Zhou, Chang Li, Qingyi Pan, Liyuan Wang
Title: BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
Abstract: Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multimodal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduceBridgedModalityAdaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8% higher correlation and 12--15% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.
PaperID: 1951,   Poster  https://arxiv.org/pdf/2512.12360    
Authors: Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu
Title: VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
Abstract: Longform video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an agentic reasoning-over-hierarchical-memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a Controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the Controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
PaperID: 1952,   Poster  https://arxiv.org/pdf/2604.03819    
Authors: Peijun Bao, Luo Anwei, Gang Pan, Alex C. Kot, Xudong Jiang
Title: ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
Abstract: Temporal forgery localization aims to temporally identify manipulated segments in untrimmed videos. Most existing benchmarks focus on appearancelevel forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To address this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in untrimmed videos. It contains over 6K forgery video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that enhances artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset, code, and pretrained models will be made publicly available.
PaperID: 1953,   Poster  https://arxiv.org/pdf/2511.18370    
Authors: Zenghao Chai, Chen Tang, Yongkang Wong, Xulei Yang, Mohan Kankanhalli
Title: MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
Abstract: 3D pose transfer aims to transfer the posestyle of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).
PaperID: 1954,   Poster  https://arxiv.org/pdf/2601.20511    
Authors: Zelong Sun, Jiahui Wu, Ying Ba, Dong Jing, Zhiwu Lu
Title: Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
Abstract: As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, highquality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1)complex multi-attribute modificationssuch as pose, spatial layout, and camera viewpoint; and (2)high-fidelity detail preservationincluding identity, clothing, and accessories. To address these challenges, we proposeCHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further proposeSCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance in handling complex edits with identity and fine-grained details consistency.
PaperID: 1955,   Poster  https://arxiv.org/pdf/2604.16201    
Authors: Nikhil Behari, Diego Rivero, Luke Apostolides, Suman Ghosh, Paul Pu Liang, Ramesh Raskar
Title: DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
Abstract: Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full timeresolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space–time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.
PaperID: 1956,   Poster  https://arxiv.org/pdf/2510.20331    
Authors: Kangli Wang, Qianxi Yi, Yuqi Ye, Shihao Li, Wei Gao
Title: AnyPcc: Compressing Any Point Cloud with a Single Universal Model
Abstract: Generalization remains a critical challenge in deep learningbased point cloud geometry compression. While existing methods perform well on standard benchmarks, their performance collapses in real-world scenarios due to two fundamental limitations: the lack of context models that are robust across diverse data densities, and the inability to efficiently adapt to out-of-distribution (OOD) data. To overcome both challenges, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages coarse-grained spatial priors with fine-grained channel priors to ensure robust context modeling across the entire density spectrum. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. For each instance, it fine-tunes a small subset of network weights and transmits them within the bitstream. The minimal bitrate overhead from these weights is significantly outweighed by the resulting gains in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression while maintaining low complexity. Our code and datasets will be released to encourage reproducible research.
PaperID: 1957,   Poster  https://arxiv.org/pdf/2512.12193    
Authors: Xuancheng Xu, Li Yaning, Sisi You, Bing-Kun Bao
Title: SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
Abstract: Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of objectlevel guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject appearance and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.
PaperID: 1958,   Poster  https://arxiv.org/pdf/2510.26213    
Authors: Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang, Weijia Li, Conghui He
Title: OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning
Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattanstyle structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M^6Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, dataset, and models will be publicly released.
PaperID: 1959,   Poster  https://arxiv.org/pdf/2603.21069    
Authors: Yupeng Zhang, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan
Title: NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
Abstract: Despite the remarkable progress in openvocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework—NoOVD—which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs).Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation—without requiring additional data—thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects.Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.
PaperID: 1960,   Poster  https://arxiv.org/pdf/2604.10554    
Authors: Yapeng Meng, Lin Yang, Yuguo Chen, Xiangru Chen, Taoyi Wang, Lijian Wang, Zheyu Yang, Yihan Lin, Rong Zhao
Title: Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
Abstract: Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intraexposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion.Inspired by the human visual system, neuromorphic sensors introduce temporally dense information to alleviate this problem; however, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness.As a recent breakthrough, the complementary vision sensor (CVS) captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (\mathcalSD, encoding structural edges) and temporal difference (\mathcalTD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses \mathcalSD and \mathcalTD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Our code, dataset and pre-trained weights will be fully publicly available.
PaperID: 1961,   Poster  https://arxiv.org/pdf/2603.04870    
Authors: Jaekyun Ko, Dongjin Kim, Soomin Lee, Guanghui Wang, Tae Hyun Kim
Title: Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Abstract: Denoising in the sRGB image space is challenging due to noise variability.Although endto-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability.Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise.By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis.Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.
PaperID: 1962,   Poster  https://arxiv.org/pdf/2604.10772    
Authors: Haiyan Jiang, Deyu Zhang, dongdong weng, Weitao Song, Henry Duh
Title: HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
Abstract: The 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while datadriven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.
PaperID: 1963,   Poster  https://arxiv.org/pdf/2602.22025    
Authors: Shuang Song, Debao Huang, Deyan Deng, Haolin Xiong, Yang Tang, Yajie Zhao, Rongjun Qin
Title: Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
Abstract: Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding largescale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.
PaperID: 1964,   Poster  https://arxiv.org/pdf/2604.03706    
Authors: Hongxia Gao, Yixin Chen, Jiali Wen, Litao Li, Qianyun Liu, Kaijie Zhang
Title: XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening
Abstract: Xray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM’s poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.
PaperID: 1965,   Poster  https://arxiv.org/pdf/2509.24837    
Authors: Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong
Title: ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
Abstract: Large Vision–Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attentionbased methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token’s influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.
PaperID: 1966,   Poster  https://arxiv.org/pdf/2603.12078    
Authors: Hiran Sarkar, Liming Kuang, Yordanka Velikova, Benjamin Busam
Title: Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
Abstract: Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. NodeRF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions. Our code will be made publicly available.
PaperID: 1967,   Poster  https://arxiv.org/pdf/2510.23497    
Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Title: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Abstract: Training visionlanguage models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone.We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance.We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher
PaperID: 1968,   Poster  https://arxiv.org/pdf/2603.24322    
Authors: Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun, Xiaoqing Chen, Xingyu Liu, Zheng Wang, Kaiyan Zhao
Title: Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
Abstract: The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, highdimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an \emphautonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source–target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving), and shows generalization ability in synthetic-to-real semantic segmentation (i.e., SYNTHIA \rightarrow Cityscapes).
PaperID: 1969,   Poster  https://arxiv.org/pdf/2507.06993    
Authors: Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Chung, Dragomir Yankov, Chiqun Zhang
Title: IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
Abstract: Most mapping tools remain pointand-click, making it hard to ask spatial questions or relate what a camera sees to its surrounding geography in a view-aware way. We presentIMAIA— theInteractive Maps AI Assistant— which enables natural-language interaction with both vector (street) maps and satellite imagery, while enriching camera inputs with geospatial intelligence to help users interpret the world around them.IMAIA consists of two complementary modules:Maps Plus, which treats the map as primary context by converting tiled vector or satellite views into a grid-aligned format that language models can query to resolve deictic references (e.g., “the flower-shaped building next to the park in the top-right”).Places AI Smart Assistant (PAISA), which performs camera-aware place reasoning by fusing image–place embeddings with geospatial signals such as location, heading, and distance to ground the scene, highlight key attributes, and produce concise explanations.A lightweight multi-agent design ensures low latency and transparent intermediate reasoning. Across map-centric question answering and camera-to-place grounding tasks, IMAIA consistently improves accuracy and responsiveness over strong baselines while remaining efficient for real-world use. By uniting language, maps, and geospatial cues, IMAIA advances from scripted interactions toconversational mappingthat is both spatially grounded and widely accessible.
PaperID: 1970,   Poster  https://arxiv.org/pdf/2604.06631    
Authors: Zheng Jiang, Nan He, Yiming Chen, Lifeng Sun
Title: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport
Abstract: Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: serverside pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel's deviation from the global model, with the penalty's strength scaled by the client's pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.
PaperID: 1971,   Poster  https://arxiv.org/pdf/2602.22571    
Authors: tianyu chen, Wei Xiang, Kang Han, Lu Yu, Di Wu, Gaowen Liu, Ramana Kompella
Title: GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
Abstract: Feedforward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
PaperID: 1972,   Poster  https://arxiv.org/pdf/2602.21977    
Authors: Liangwei Lyu, Jiaqi Xu, Jianwei Ding, Qiyao Deng
Title: When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
Abstract: LowRank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack sur-face. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first backdoor attack that leverages the LoRA mechanism to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of “trigger word-target image” pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; other-wise, it behaves indistinguishably from the clean model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.
PaperID: 1973,   Poster  https://arxiv.org/pdf/2601.04033    
Authors: Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, Xiang Wang
Title: Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
Abstract: Recent advances in video reward models and posttraining strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video.To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion.We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structural distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
PaperID: 1974,   Poster  https://arxiv.org/pdf/2604.09955    
Authors: Tzu-Ling Liu, Ian Stavness, Mrigank Rochan
Title: Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
Abstract: Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting realworld adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.
PaperID: 1975,   Poster  https://arxiv.org/pdf/2603.19770    
Authors: Zekai Wu, Shuqi Fan, Mengyin Liu, Yuhua Luo, Xincheng Lin, Ming Yan, Junhao Wu, Xiuhong Lin, Yuexin Ma, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang
Title: FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
Abstract: Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of hightemporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.
PaperID: 1976,   Poster  https://arxiv.org/pdf/2512.12080    
Authors: Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
Title: BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models
Abstract: Autoregressive video models are promising for world modeling via nextframe prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model’s own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses --which can hurt quality and diversity -- BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.
PaperID: 1977,   Poster  https://arxiv.org/pdf/2509.24899    
Authors: Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian
Title: Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
Abstract: Transformerbased video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism—mixing softmax and linear tokens—with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos.
PaperID: 1978,   Poster  https://arxiv.org/pdf/2511.15700    
Authors: Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos
Title: First Frame Is the Place to Go for Video Content Customization
Abstract: What role does the first frame play in video generation models? Traditionally, it’s viewed as the spatialtemporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20–50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
PaperID: 1979,   Poster  https://arxiv.org/pdf/2508.03088    
Authors: Kai Zhang, Zekai Zhang, Xihe Sun, Anpeng Wang, Jingmeng Nie, Qinghui Chen, Han Hao, Jianyuan Guo, jinglin zhang
Title: ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning
Abstract: Automatic vision inspection holds significant importance in industry inspection. While multimodal large language models (MLLMs) exhibit strong language understanding capabilities and hold promise for this task, their performance remains significantly inferior to that of human experts. In this context, we identify two key challenges: (i) insufficient integration of anomaly detection (AD) knowledge during pretraining and (ii) the lack of technically precise and context-aware language generation for anomaly reasoning. To address these issues, we propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning. ADSeeker first leverages a curated visual document knowledge base, SEEK-MVTec&VisA (SEEK-M&V), which we construct to address the limitations of existing resources that rely solely on unstructured text. SEEK-M&V includes semantic-rich descriptions and image-document pairs, enabling more comprehensive anomaly understanding. To effectively retrieve and utilize this knowledge, we introduce the Query Image-Knowledge Retrieval-Augmented Generation (Q2K RAG) framework. To further enhance the performance in zero-shot anomaly detection (ZSAD), ADSeeker leverages the Hierarchical Sparse Prompt mechanism and type-level features to efficiently extract anomaly patterns. Furthermore, to tackle the challenge of limited industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA), encompassing 72 multi-scale defect types across 26 categories. Extensive experiments show that our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.
PaperID: 1980,   Poster  https://arxiv.org/pdf/2512.16740    
Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong
Title: Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a taskoriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text–image–mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
PaperID: 1981,   Poster  https://arxiv.org/pdf/2511.19965    
Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Title: HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating highquality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
PaperID: 1982,   Poster  https://arxiv.org/pdf/2603.16340    
Authors: Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang
Title: Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
Abstract: In this paper, we propose Iris, a deterministic framework for Monocular Depth Estimation (MDE) that integrates realworld priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.
PaperID: 1983,   Poster  https://arxiv.org/pdf/2603.29954    
Authors: JunWoo Heo, Keonhee Park, Gyeong-Moon Park
Title: Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
Abstract: In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify given known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class prediction information for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energybased Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of ETF-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations and leverages energies from both spaces to better capture distinct patterns of unknown objects, in contrast to prior energy-based approaches that consider only the energy within the known space. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.
PaperID: 1984,   Poster  https://arxiv.org/pdf/2511.16928    
Authors: Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng, Mai Xu
Title: Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features
Abstract: Diffusion model (DM) based Video SuperResolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37% tLPIPS reduction).
PaperID: 1985,   Poster  https://arxiv.org/pdf/2603.19722    
Authors: Tian Wen, Zhiqin Yang, Yonggang Zhang, Xuefeng Jiang, Hao Peng, Yuwei Wang, Bo Han
Title: FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
Abstract: Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose FedRG(Federated underRepresentationGemometry), which follows''the principle of ``representation geometry priority''to recognize noisy labels. Firstly, FedRG creates labelagnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry using previously identified clean samples to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the vMF mixture model with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that FedRG significantly outperforms state-of-the-art methods for FL with data heterogeneity under diverse noisy client scenarios.
PaperID: 1986,   Poster  https://arxiv.org/pdf/2510.05057    
Authors: Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, Chunhua Shen
Title: StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
Abstract: A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in taskcritical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 11.6% on LIBERO and 31% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.
PaperID: 1987,   Poster  https://arxiv.org/pdf/2510.27684    
Authors: Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, RUIHAO GONG, Lei Yang
Title: Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Abstract: Distribution Matching Distillation (DMD) distills scorebased generative models into efficient one-step generators,without requiring a one-to-one correspondence with the sampling trajectories of their teachers.Yet, the limited capacity of one-step distilled models compromises generative diversity and degrades performance in complex generative tasks, e.g., generating intricate object motions in text-to-video task.Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution,we observe that it substantially reduces the generative diversity in text-to-image generation and slows motion dynamics in video generation, reducing performance to the level of one-step models.To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity.Phased DMD incorporates two key ideas: progressive distribution matching and score matching within subintervals.First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions.Next, to ensure accurate training within each subinterval, we derive rigorous mathematical formulations for the objective.We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image-20B and Wan2.2-28B.Experiments demonstrate that Phased DMDenhances motion dynamics, improves visual fidelity in video generation, and increases output diversity in image generation.We will release our code and models.
PaperID: 1988,   Poster  https://arxiv.org/pdf/2603.16570    
Authors: Amirhossein Kazerouni, Maitreya Suin, Tristan T Aumentado-Armstrong, Sina Honari, Amanpreet Walia, Iqbal Mohomed, Kosta Derpanis, Babak TAATI, Alex Levinshtein
Title: Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Abstract: Recent advances in image restoration have enabled highfidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we proposeFace2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored–degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.
PaperID: 1989,   Poster  https://arxiv.org/pdf/2511.19004    
Authors: Wentao Qu, Guofeng Mei, Yang Wu, Yongshun Gong, Xiaoshui Huang, Liang Xiao
Title: A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
Abstract: Textto-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
PaperID: 1990,   Poster  https://arxiv.org/pdf/2512.15508    
Authors: Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero
Title: Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
Abstract: Feedforward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, ``Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
PaperID: 1991,   Poster  https://arxiv.org/pdf/2603.19224    
Authors: YANG FU, Yike Zheng, Ziyun Dai, Henghui Ding
Title: EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusionbased video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduceVOR(VideoObjectRemoval), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60k high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we proposeEffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion–removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
PaperID: 1992,   Poster  https://arxiv.org/pdf/2604.07990    
Authors: Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen
Title: SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for largescale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.
PaperID: 1993,   Poster  https://arxiv.org/pdf/2603.28363    
Authors: Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim, Jihie Kim
Title: SEA: Evaluating Sketch Abstraction Efficiency via Element-level Common-sense Visual Question Answering
Abstract: A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, lowlevel visual features, or recognition accuracy do not capture abstraction, the defining property of sketches.To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under simple visual representations.To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
PaperID: 1994,   Poster  https://arxiv.org/pdf/2510.02898    
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
Title: One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Abstract: Zeroshot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense captioning and region-set captioning. We also introduce a new trace captioning task that further demonstrates the effectiveness of patch-wise semantic representations for flexible caption generation.
PaperID: 1995,   Poster  https://arxiv.org/pdf/2604.10963    
Authors: Ruiyang Li, Fang Liu, Licheng Jiao, Xinglin Xie, Jiayao Hao, Shuo Li, Xu Liu, Jingyi yang, Lingling Li, Puhua Chen, Wenping Ma
Title: Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Abstract: Medical image segmentation provides critical support for clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets are often affected by acquisition noise and annotation ambiguity, leading to pervasive data uncertainty that substantially undermines model robustness. The existing study focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertaintydriven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures (CNN, Transformer, and Mamba), revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks. The code will be released.
PaperID: 1996,   Poster  https://arxiv.org/pdf/2603.02438    
Authors: Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini
Title: ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
Abstract: Document Visual Question Answering (DocVQA) remains challenging for existing VisionLanguage Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning. The code will be available upon acceptance.
PaperID: 1997,   Poster  https://arxiv.org/pdf/2603.00938    
Authors: SHRESHTH SAINI, Bowen Chen, Yilin Wang, Neil Birkbeck, Balu Adsumilli, Alan Bovik
Title: Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Abstract: High Dynamic Range (HDR) usergenerated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR’s higher bit depth, wide color gamut, and elevated luminance range expose distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate HDR-UGC-44K, a large-scale subjective dataset of ~44K videos from 6.5K sources with >1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR–SDR contrastive KL that encourages token reliance on HDR inputs and a gaussian weighted regression reward for fine-grained MOS calibration. Across HDR-UGC-44K and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance. The dataset and code will be released.
PaperID: 1998,   Poster  https://arxiv.org/pdf/2508.03081    
Authors: Bo Zhang, Xu Xinan, Shuo Yan, Yu Bai, Zheng Zhang, Wufan Wang, Hui Gao, Wendong Wang
Title: Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
Abstract: Recent pseudobag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation (\textC^2Aug) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. Experimental results demonstrate that \textC^2Aug consistently outperforms state-of-the-art approaches across multiple evaluation metrics.
PaperID: 1999,   Poster  https://arxiv.org/pdf/2603.14726    
Authors: Gyeongsik Moon
Title: Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
Abstract: Accurately recovering hand poses within the body context remains a major challenge in 3D wholebody pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose WholeBody++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows WholeBody++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that WholeBody++ substantially improves hand accuracy and enhances overall full-body pose quality. Code and pretrained models will be released publicly.
PaperID: 2000,   Poster  https://arxiv.org/pdf/2603.07559    
Authors: Weijia Feng, Jingyu Yang, Ruojia Zhang, Fengtao Sun, Qian Gao, Chenyang Wang, tongtong Su, Jia Guo, Xiaobai Li, Minglai Shao
Title: Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
Abstract: Microgestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human–computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference–based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.
PaperID: 2001,   Poster  https://arxiv.org/pdf/2512.03004    
Authors: Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hongyang Li, Ya-Qin Zhang, Hangjun Ye, Hao Zhao
Title: DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Abstract: Autonomous driving needs fast, scalable 4D reconstruction and resimulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
PaperID: 2002,   Poster  https://arxiv.org/pdf/2603.20470    
Authors: Zhuoling Li, Hossein Rahmani, Jiarui Zhang, Yu Xue, Majid Mirmehdi, Jason Kuen, Jiuxiang Gu, Jun Liu
Title: DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Abstract: The rapid growth of the textto-image (T2I) community has fostered a thriving online ecosystem of expert models, which are variants of pretrained diffusion models specialized for diverse generative capabilities. Yet, existing model merging methods remain limited in fully leveraging abundant online expert resources and still struggle to meet diverse in-the-wild user needs. We present DiffGraph, a novel automated agent-driven graph-based model merging framework, which automatically harnesses online experts and flexibly merges them for diverse user needs. Our DiffGraph constructs a scalable graph and organizes ever-expanding online experts within it through node registration and calibration. Then, DiffGraph dynamically activates specific subgraphs based on user needs, enabling flexible combinations of different experts to achieve user-desired generation. Extensive experiments show the efficacy of our method.
PaperID: 2003,   Poster  https://arxiv.org/pdf/2603.22689    
Authors: Mingrui Chen, Hexiong Yang, Haogeng Liu, Huaibo Huang, Ran He
Title: Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth
Abstract: In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoningwidth, a complementary dimension to the more commonly studied reasoningdepth.Specifically, reasoning depth measures the model’s ability to carry out longchain, sequential reasoning in which each step is tightly and rigorously linked to the next.Reasoning width tends to focus more on the model’s capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking.To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoningwidthanddepth.We evaluate12major model families (over30advanced MLLMs) across difficulty tiers, question types, and required skills.Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not onlydeeperbut alsowider.Our code is available insupplementary materials.
PaperID: 2004,   Poster  https://arxiv.org/pdf/2511.14070    
Authors: Junsik Kim, Gun Bang, Soowoong Kim
Title: ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
Abstract: Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bitdepths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI.Code and models will be released upon publication.
PaperID: 2005,   Poster  https://arxiv.org/pdf/2603.18991    
Authors: Zening Sun, Zhengpeng Xie, Lichen Bai, Shitong Shao, Shuo Yang, Zeke Xie
Title: CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating highquality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220× faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.
PaperID: 2006,   Poster  https://arxiv.org/pdf/2602.19910    
Authors: Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, CHUNGUANG LI
Title: MultiModal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
Abstract: Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging openset recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily depend upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR^2-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
PaperID: 2007,   Poster  https://arxiv.org/pdf/2512.19526    
Authors: Puyin Li, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei, Ehsan Adeli
Title: QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Abstract: Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether stateof-the-art vision perception models (e.g., large VLMs) can perform quantitative physical reasoning tasks. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from videos. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video–text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs when reasoning about objects’ kinematic properties. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
PaperID: 2008,   Poster  https://arxiv.org/pdf/2506.01085    
Authors: Qian Yang, Shivam Chandhok, Oscar Mañas, Kanishk Jain, Aishwarya Agrawal, Leonid Sigal
Title: Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Abstract: Instruction tuning has been central to the success of recent visionlanguage models (VLMs), but it remains expensive—requiring large scale datasets, high-quality annotations and large-compute budget. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection -- PROGRESS -- a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples: those it has not already mastered and are not too difficult to learn at the current state of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, querying answers only on a need basis, avoids reliance on additional supervision from auxiliary VLM, or compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization to different VLMs and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.
PaperID: 2009,   Poster  https://arxiv.org/pdf/2511.14625    
Authors: Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, Jiangmiao Pang
Title: Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
Abstract: Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure.This paper presents Gallant, a voxelgrid–based framework for humanoid locomotion and local navigation in 3D constrained terrains.It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency.Experimental results show that Gallant’s broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near-100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization. This project will be fully open-source.
PaperID: 2010,   Poster  https://arxiv.org/pdf/2511.19524    
Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang
Title: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Abstract: By leveraging toolaugmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single and fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes.(1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user’s query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the top baseline by 3.6% and surpasses GPT-4o by 15.6%.
PaperID: 2011,   Poster  https://arxiv.org/pdf/2511.15690    
Authors: yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, RUIHAO GONG, Jinyang Guo, Xianglong Liu, Jun Zhang
Title: MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Abstract: Mixtureof-Experts (MoE) Multimodal large language models (MLLMs) excel at vision–language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods—originally designed for unimodal large language models (LLMs)—to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16× and the decoding time by 1.26×.
PaperID: 2012,   Poster  https://arxiv.org/pdf/2603.20725    
Authors: Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang, Yalong Bai, Hongzhi Zhang, Wangmeng Zuo
Title: Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Abstract: Textto-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization.We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt.To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process.To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization.Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
PaperID: 2013,   Poster  https://arxiv.org/pdf/2603.16455    
Authors: Li Weiqing, Jinyue Guo, Yaqi Wang, HAIYANG XIAO, Yuewei Zhang, Guohua Liu, Hao Henry Wang
Title: Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
Abstract: Visuallanguage models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel viewpoint-pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary in above collaboration scenario is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model’s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2 % and 77.1 %, proving the efficacy of our evolutionary approach.
PaperID: 2014,   Poster  https://arxiv.org/pdf/2511.05038    
Authors: Zhengxuan Li, Qinhui Yang, Yiyu Zhuang, Chuan Guo, Xinxin Zuo, Xiao-Xiao Long, Yao Yao, Xun Cao, Qiu Shen, Hao Zhu
Title: Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance
Abstract: We present Pressure2Motion, a novel motion capture algorithm that reconstructs human motion from a ground pressure sequence and text prompt. At inference time, Pressure2Motion requires only a pressure mat, eliminating the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacypreserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminacy of pressure signals with respect to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint to resolve ambiguities. Specifically, our model adopts a dual-level feature extractor to accurately interpret pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion estimation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion reconstruction, and the established MPL benchmark is the first benchmark for this novel motion capture task. Experiments show that our method generates high-fidelity, physically plausible motions, establishing a new state of the art for this task. The codes and benchmarks will be publicly released upon publication.
PaperID: 2015,   Poster  https://arxiv.org/pdf/2604.12309    
Authors: Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait, Hongdong Li
Title: Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixelwise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.
PaperID: 2016,   Poster  https://arxiv.org/pdf/2602.22376    
Authors: Hanyang Liu, Rongjun Qin
Title: AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
Abstract: Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with singleview capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion.The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.
PaperID: 2017,   Poster  https://arxiv.org/pdf/2507.14533    
Authors: Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo.Qu Bo.Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu
Title: ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
Abstract: The rapid advancement of educational applications, artistic creation, and AIgenerated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present: (1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated IAA dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public.
PaperID: 2018,   Poster  https://arxiv.org/pdf/2512.04678    
Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
Title: Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill fewstep video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
PaperID: 2019,   Poster  https://arxiv.org/pdf/2512.11336    
Authors: Hewen Pan, Cong Wei, liang da shuang, Zepeng Huang, Gaopengfei Gaopengfei, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu
Title: UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
Abstract: With the advancement of multimodal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.
PaperID: 2020,   Poster  https://arxiv.org/pdf/2601.11508    
Authors: Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni
Title: ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Abstract: Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining consistent instance identities across intermittently captured 3D scans with unobserved change or, equivalently, performing 4D indoor semantic instance segmentation (SIS)the joint task of segmenting, identifying, and temporally associating object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and 4D LiDAR approaches, which show limited performance due to their reliance on continuous temporal measurements that is uncommon in indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores temporal fusion strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To rigorously evaluate this task, we define a new metric that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
PaperID: 2021,   Poster  https://arxiv.org/pdf/2512.21865    
Authors: Yihan Hu, Xuelin Chen, Xiaodong Cun
Title: EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition
Abstract: Existing video omnimatte methods typically rely on slow, multistage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
PaperID: 2022,   Poster  https://arxiv.org/pdf/2506.07985    
Authors: Tuomas Oikarinen, Ge Yan, Akshay R. Kulkarni, Tsui-Wei Weng
Title: Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Abstract: Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowdsourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13×. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3×. Together, these techniques reduce the evaluation cost by ~40×, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.
PaperID: 2023,   Poster  https://arxiv.org/pdf/2512.17495    
Authors: Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo
Title: GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Abstract: Visual grounding—localizing objects from natural language descriptions—represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with humanlike sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative—distinguishing highly similar objects, (2) Spatial—understanding complex relational descriptions, (3) Limited—handling occlusions or tiny objects, and (4) Rejection—recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks—reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
PaperID: 2024,   Poster  https://arxiv.org/pdf/2603.23885    
Authors: Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu ZHOU
Title: Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or nonstandard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions—primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data–training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
PaperID: 2025,   Poster  https://arxiv.org/pdf/2602.20794    
Authors: Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen
Title: VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
Abstract: The significance of crossview 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It’s our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.
PaperID: 2026,   Poster  https://arxiv.org/pdf/2603.10354    
Authors: Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, FANG LIU, Zhiping Cai
Title: StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Abstract: Despite the advancements in diffusionbased image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability.To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization.It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function–guided diffusion sampling with regional style loss to optimize stylization).Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references. The source code and dataset have been released.
PaperID: 2027,   Poster  https://arxiv.org/pdf/2603.21864    
Authors: Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, Peng Jiang
Title: Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
Abstract: Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inferencetime frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
PaperID: 2028,   Poster  https://arxiv.org/pdf/2603.21208    
Authors: Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin zhou, Lan Tao, Zhan Qin, Kui Ren
Title: JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Abstract: Textto-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.
PaperID: 2029,   Poster  https://arxiv.org/pdf/2603.25977    
Authors: Gustavo Chau, Mohammad H. Abbasi, Camila Blank, Juze Zhang, Alan Q. Wang, Sophie Ostmeier, Akshay Chaudhari, Kilian Pohl, Ehsan Adeli
Title: Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
Abstract: Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning generalpurpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores.
PaperID: 2030,   Poster  https://arxiv.org/pdf/2603.29941    
Authors: Vanessa Emanuela Guarino, Claudia Winklmayr, Jannik Franzen, Josef Rumberger, Manuel Pfeuffer, Sonja Greven, Klaus Maier-Hein, Dagmar Kainmueller, Christoph Karg, Carsten T. Lüth
Title: Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Abstract: Uncertainty quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safetycritical domains like biomedical image analysis or autonomous driving. UQ generates pixel-wise uncertainty maps that must be aggregated into scalar scores for downstream tasks like OoD- or failure-detection.Despite widespread use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied.Global Average is the default choice, yet it does not account for spatial and structural features of uncertainty estimates. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices.We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure.We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, performance of individual aggregators is highly dependent on dataset characteristics, thus we propose a meta aggregator that integrates multiple aggregators and shows robust performance across datasets.To foster reproducibility, we release an open-source Python package for benchmarking uncertainty aggregation methods.
PaperID: 2031,   Poster  https://arxiv.org/pdf/2512.16921    
Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan L. Yuille, Wen-Sheng Chu
Title: Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduceAuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM finetunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
PaperID: 2032,   Poster  https://arxiv.org/pdf/2601.10553    
Authors: Jianhao Yuan, Zhang Xiaofeng, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
Title: Inference-time Physics Alignment of Video Generative Models with Latent World Models
Abstract: Stateof-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.
PaperID: 2033,   Poster  https://arxiv.org/pdf/2507.10610    
Authors: Zihe Yan, Zhuosheng Zhang, Jiaping Gui, Gongshen Liu
Title: LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
Abstract: Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decisionmaking abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose LaSM, a Layer-wise Scaling Mechanism that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.
PaperID: 2034,   Poster  https://arxiv.org/pdf/2603.01685    
Authors: Shitong Shao, Yufei Gu, Zeke Xie
Title: FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multistep sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we proposeFastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
PaperID: 2035,   Poster  https://arxiv.org/pdf/2603.25131    
Authors: Yaowen Chang, Zhen Cao, Xu Zheng, Xiaoxin Mi, Zhen Dong
Title: Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
Abstract: Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from labelrich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.
PaperID: 2036,   Poster  https://arxiv.org/pdf/2512.05597    
Authors: Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers
Title: Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
Abstract: Recent perceptiongeneralist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structural language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability.Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only 7.5% additional parameters.
PaperID: 2037,   Poster  https://arxiv.org/pdf/2512.17320    
Authors: Lu Wei, Yuta Nakashima, Noa Garcia
Title: EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
Abstract: The widespread adoption of textto-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright).Our results show that existing methods struggle with implicit prompts (i.e. generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e. failing to generated non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.
PaperID: 2038,   Poster  https://arxiv.org/pdf/2603.29272    
Authors: Soomin Park, Eunseong Lee, Kwang Bin Lee, Sung-Hee Lee
Title: MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Abstract: We present MaskAdapt, a framework for flexible motion adaptation in physicsbased humanoid control.The framework follows a two-stage residual learning paradigm.In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions.This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions.In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere.We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator.Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
PaperID: 2039,   Poster  https://arxiv.org/pdf/2509.24897    
Authors: Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, rundong wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, bozhou li, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, YiFan Zhang, Ziwei Liu
Title: RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward generalpurpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.
PaperID: 2040,   Poster  https://arxiv.org/pdf/2512.05198    
Authors: Rowan Bradbury, Elea Zhong
Title: Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
Abstract: Linearly interpolating between VAE latents using a downsampled mask field remains a common heuristic for diffusion inpainting. However, this approach systematically violates a key principle: latent compositing must respect pixel equivalence; compositing latents must approximate compositing pixels. Because VAE latents capture global context rather than pixellocal structure, linear interpolation fails this requirement, producing seams, color shifts, and halos that diffusion subsequently amplifies into larger artifacts.We propose Pixel-Equivalent Latent Compositing (PELC) and instantiate it with DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and a nonlinear residual to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead.On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing (e.g., overlays, tone/relighting, warps), as we demonstrate on a complex color-correction task.
PaperID: 2041,   Poster  https://arxiv.org/pdf/2506.01783    
Authors: Honglu Zhang, Zhiqin Fang, Ningning Zhao, Saihui Hou, Long Ma, Renwang Pei, Zhaofeng He
Title: Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing
Abstract: Face AntiSpoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image–text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision–language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.
PaperID: 2042,   Poster  https://arxiv.org/pdf/2603.20808    
Authors: Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng
Title: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Abstract: While Multimodal Large Language Models (MLLMs) excel at visionlanguage tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
PaperID: 2043,   Poster  https://arxiv.org/pdf/2511.17487    
Authors: Mark Endo, Serena Yeung
Title: Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Abstract: Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instructionrelevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
PaperID: 2044,   Poster  https://arxiv.org/pdf/2603.00512    
Authors: Wang Chen, Yuhui zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng
Title: Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
Abstract: Frame selectoin is crucial due to high frame redundancy and limited context windows when applying Large VisionLanguage Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts—pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, a direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations.To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
PaperID: 2045,   Poster  https://arxiv.org/pdf/2511.19394    
Authors: Rachit Saluja, Asli Cihangir, Ruining Deng, Johannes C. Paetzold, Fengbei Liu, Mert Sabuncu
Title: BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
Abstract: Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all nonlesion pixels into a single “background” class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous—composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models.In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.
PaperID: 2046,   Poster  https://arxiv.org/pdf/2512.00903    
Authors: Chaojun Ni, Chen Cheng, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei
Title: SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
Abstract: Vision–Language–Action (VLA) models built on pretrained Vision–Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a maskand-reconstruct strategy that randomly masks 4D inputs to the VLM and trains the VLA to reconstruct the masked features. This self-reconstruction objective helps learn effective 4D representations, allowing the 4D branch to be dropped at inference with minimal performance loss. Extensive experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7× larger. On edge devices, SwiftVLA achieves comparable performance while being 18× faster than the \pi_0 and reducing the memory footprint by 12×.
PaperID: 2047,   Poster  https://arxiv.org/pdf/2511.15186    
Authors: Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Seo, Eunho Yang, Edward Choi
Title: Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Abstract: The applicability of current lesion segmentation models for chest Xrays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest x-ray images and their corresponding reports.MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS.ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
PaperID: 2048,   Poster  https://arxiv.org/pdf/2602.21599    
Authors: Weisheng Xu, Qiwei Wu, Jiaxi Zhang, Jing Tan, Yangfan Li, Yuetong Fang, Jiaqi Xiong, Kai Wu, Rong OU, Renjing Xu
Title: Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
Abstract: Physicsbased humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
PaperID: 2049,   Poster  https://arxiv.org/pdf/2603.22094    
Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He
Title: Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
Abstract: As visionlanguage models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage.Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness.However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs.Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness.To better balance safety and utility, we propose \textttNullSteer, a null-space projected activation defense framework.Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model’s general capabilities.Extensive experiments show that \textttNullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15% on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.
PaperID: 2050,   Poster  https://arxiv.org/pdf/2604.08645    
Authors: Makanjuola Adekunmi Ogunleye, Eman Abdelrahman, Ismini Lourentzou
Title: 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Abstract: Large Language Models are increasingly integrated as the cognitive core of 3D embodied agents to enable complex environmental reasoning. However, these agents tend to inherit the critical flaw of hallucination, often failing to ground their responses to their 3D view. While Visual Contrastive Decoding (VCD) is a powerful trainingfree method for mitigating hallucinations in 2D image-based models, it has not been adapted to the complex 3D embodied environment. In this paper, we embarked on the ambitious goal of being the very first to bridge this gap by introducing a VCD framework for 3D embodied agents. Our method operates at inference time by generating a "negative" 3D context, not by blurring an image, but by applying novel distortions directly to a 3D scene graph, such as swapping object category labels or noising positional coordinates. We evaluate our approach on standard evaluation benchmarks and find that it consistently outperforms existing models. For example, in the random category of 3D-POPE, our 3D-VCD method reduced Yes-rate from 99.9% to 75.1% while simultaneously increasing precision from 50.0% to 62.2%. These results demonstrate that our training-free approach effectively curbs hallucination, yielding 3D agents that are significantly more reliable and grounded.
PaperID: 2051,   Poster  https://arxiv.org/pdf/2512.19687    
Authors: Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matthew Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu
Title: Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
Abstract: We introduce Perception EncoderAudiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE~\citeppe, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio–video, audio–text, and video–text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio–video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects—avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. Our models and code will be available.
PaperID: 2052,   Poster  https://arxiv.org/pdf/2511.20158    
Authors: Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang
Title: Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that realworld MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.
PaperID: 2053,   Poster  https://arxiv.org/pdf/2603.25791    
Authors: Zikai Wang, Zhilu Zhang, Yiqing Wang, Hui Li, Wangmeng Zuo
Title: ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
Abstract: Existing handobject interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. The code and datasets will be made publicly available.
PaperID: 2054,   Poster  https://arxiv.org/pdf/2602.20943    
Authors: Kaiyuan Tan, Yingying Shen, Ziyue Zhu, Mingfei Tu, HAOHUI ZHU, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye
Title: UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
Abstract: Dynamic driving scene reconstruction is critical for autonomous driving simulation and closedloop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.
PaperID: 2055,   Poster  https://arxiv.org/pdf/2505.20967    
Authors: Jiarui Zhang, Zhihao Li, Chong Wang, Bihan Wen
Title: RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
Abstract: Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in realworld outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated scene flow module further predicts temporal offsets between adjacent frames, enforcing temporal occupancy coherence during dynamic scene reconstruction. Moreover, we propose a radar-specific power rendering formulation grounded in radar sensing physics, improving both synthesis accuracy and interpretability. Extensive experiments on public radar datasets demonstrate that RF4D substantially outperforms existing methods in radar measurement synthesis and occupancy estimation accuracy, with particularly strong gains in dynamic outdoor environments.
PaperID: 2056,   Poster  https://arxiv.org/pdf/2602.22932    
Authors: Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan
Title: MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
Abstract: Efficiently understanding longform videos remains a fundamental challenge for multimodal large language models (MLLMs).In this paper, we present MLLM-Sampler Joint-Evolution (MiSJoE), a novel framework thatjointly evolvesthe MLLM and a lightweight key-frame sampler for efficient long-form video understanding.MiSJoE builds upon a key assumption thatonly a small subset of key-frames is truly informative for answering each question to a video.Specifically, MiSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question.Then, these queries interact with a frozen CLIP model to produce a query–frame similarity matrix. Finally, A lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation.Both the MLLM and sampler arejointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding.A new long-video QA dataset containing 2.8k videos with 7k question–answer pairs is collected to support the training process.Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MiSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.
PaperID: 2057,   Poster  https://arxiv.org/pdf/2512.12378    
Authors: Fan Junqiao, Yunjiao Zhou, Yizhuo Yang, Xinyuan Cui, Jiarui Zhang, Lihua Xie, Jianfei Yang, Chris Xiaoxuan Lu, Fangqiang Ding
Title: M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
Abstract: Human mesh reconstruction (HMR) provides direct insights into bodyenvironment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) (9× prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
PaperID: 2058,   Poster  https://arxiv.org/pdf/2512.06251    
Authors: Fangzhou Lin, Yuping Wang, Yuliang Guo, Zixun Huang, Xinyu Huang, Haichong Zhang, Kazunori Yamada, Zhengzhong Tu, Liu Ren, Ziming Zhang
Title: NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
Abstract: Partially Supervised MultiTask Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity.We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability. Our code will be released upon acceptance.
PaperID: 2059,   Poster  https://arxiv.org/pdf/2503.23348    
Authors: Jiude Wei, Yuxuan Li, Cewu Lu, Jianhua Sun
Title: Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
Abstract: We human rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Multimodal Large Language Models (MLLMs) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by MLLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by MLLMs and the physical world where real robots operate, we can figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized and accurate articulated object manipulation. Extensive experiments in both real world and simulation demonstrate the superiority of our approach. Please refer to the Supplementary Material for more details, and our codes will be made publicly available.
PaperID: 2060,   Poster  https://arxiv.org/pdf/2512.14140    
Authors: Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, huangjie huangjie, Zhan Zhenpeng
Title: SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
Abstract: Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style‑sensitive structure of line art while supporting both high‑level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction‑guided global edits with line‑guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute‑addition sequences from attribute‑free base sketches, (ii) forms multi‑step edit chains via cross‑sequence sampling, and (iii) expands stylistic coverage with a style‑preserving attribute‑removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT‑based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction‑guided edits and line‑guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task‑guided mixture‑of‑experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation.Extensive experiments show state‑of‑the‑art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision
PaperID: 2061,   Poster  https://arxiv.org/pdf/2604.16540    
Authors: Weijie Wang, Songlong Xing, Zhengyu Zhao, Nicu Sebe, Bruno Lepri
Title: PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
Abstract: Poisoning input views of 3D reconstruction systems has been recently studied.However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline.In this paper, we argue that the structurefrom-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also provide a theoretical analysis that connects cross-view inconsistency to correspondence collapse.Experimental results demonstrate the effectiveness of our \name on diverse 3D reconstruction systems and datasets, surpassing the single-view-based method by 25.1% in PSNR and 16.5% in SSIM in black-box transfer settings, such as 3DGS to NeRF.
PaperID: 2062,   Poster  https://arxiv.org/pdf/2604.12812    
Authors: Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai
Title: DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signalto-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``Analysis, Localization and Reasoning'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as an ideal foundation for their implementation.
PaperID: 2063,   Poster  https://arxiv.org/pdf/2508.02034    
Authors: Ziling Wang, Shuya Yang, Jialin Lu, Ka-Ho Chow
Title: Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
Abstract: Face recognition (FR) technologies are increasingly used to power largescale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user’s 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2× better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and identity tracing.
PaperID: 2064,   Poster  https://arxiv.org/pdf/2603.24558    
Authors: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan
Title: LensWalk: Agentic Video Understanding by Planning How You See in Videos
Abstract: The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful VisionLanguage Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, thetemporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
PaperID: 2065,   Poster  https://arxiv.org/pdf/2603.13366    
Authors: Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, feilong tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge
Title: Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
Abstract: Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit highentropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.
PaperID: 2066,   Poster  https://arxiv.org/pdf/2601.07060    
Authors: Yuanzhe Liu, Jingyuan Zhu, Yuchen Mo, Gen Li, Xu Cao, Jin Jin, Yifan Shen, Zhengyuan Li, Tianjiao Yu, Wenzhen Yuan, Fangqiang Ding, Ismini Lourentzou
Title: PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
Abstract: Recent advancements in visionlanguage–action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC\rightarrowD, and a 2× improvement over real-world baselines across three long-horizon generalization settings.
PaperID: 2067,   Poster  https://arxiv.org/pdf/2603.19961    
Authors: Nassim ALI OUSALAH, Peyman Abendansari, Vincent Gaudillière, Emmanuel Koumandakis, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Title: Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
Abstract: In this paper, we address the problem of 6DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
PaperID: 2068,   Poster  https://arxiv.org/pdf/2603.21660    
Authors: meilin liu, Jiaying Wang, Jing Shan
Title: OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
Abstract: Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to taskspecific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM,a modality- and task-agnostic FL framework that unifies training across classification,segmentation, super-resolution, visual question answering,and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates(i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix–Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.
PaperID: 2069,   Poster  https://arxiv.org/pdf/2507.23685    
Authors: Zihan Cheng, Liangtai Zhou, Dian Chen, Ni Tang, Xiaotong Luo, Yuan Xie, Yanyun Qu
Title: UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration
Abstract: Allin-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address the core challenges of diverse degradation modeling and detail preservation, we propose UniLDiff, a unified framework enhanced with degradation- and detail-aware mechanisms, unlocking the power of diffusion priors for robust image restoration. Specifically, we introduce a Degradation-Aware Feature Fusion (DAFF) to dynamically inject low-quality features into each denoising step via decoupled fusion and adaptive modulation, enabling implicit modeling of diverse and compound degradations. Furthermore, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery through expert routing. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.
PaperID: 2070,   Poster  https://arxiv.org/pdf/2603.21138    
Authors: Wenjin Hou, Xiaoxiao Sun, Hehe Fan
Title: Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Abstract: Recent advances in zeroshot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain. All code, models, and complete features will be open-sourced upon publication to accelerate future research.
PaperID: 2071,   Poster  https://arxiv.org/pdf/2603.25053    
Authors: Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Title: GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator
Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometryinformed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications. We plan to release our code and model.
PaperID: 2072,   Poster  https://arxiv.org/pdf/2603.08800    
Authors: Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin
Title: Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Abstract: Recent advances in multimodal large language models largely rely on CLIPbased visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by 30% and reduces hallucination by 20%, outperforming all visual encoders under identical settings. Code is available at the Supplementary.
PaperID: 2073,   Poster  https://arxiv.org/pdf/2507.16861    
Authors: Xiang Li, Zhangchi Hu, Xu Xiao, Bin Kong
Title: Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Abstract: Integrating LiDAR and camera inputs into a unified Bird’sEye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.3%.
PaperID: 2074,   Poster  https://arxiv.org/pdf/2604.05794    
Authors: Da Li, Dominik Engel, Deng Luo, Ivan Viola
Title: EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
Abstract: Strandlevel hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields.Extensive experiments on various real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.
PaperID: 2075,   Poster  https://arxiv.org/pdf/2602.23739    
Authors: xiang deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, Yebin Liu
Title: U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
Abstract: Fullstack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback.Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.
PaperID: 2076,   Poster  https://arxiv.org/pdf/2509.22615    
Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
Title: GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Abstract: Modern vision–language pipelines are driven by RGB vision encoders trained on massive image–text corpora. While these pipelines have enabled impressive zeroshot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over 90× faster fitting and ~97% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only ~9.7 - 13.8% of the total parameters.On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs 3–23.5× relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge–cloud learning.
PaperID: 2077,   Poster  https://arxiv.org/pdf/2602.19575    
Authors: Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim
Title: ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
Abstract: Personalized textto-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.
PaperID: 2078,   Poster  https://arxiv.org/pdf/2601.05368    
Authors: Svitlana Morkva, Vaishakh Patil, Alessio Tonioni, Michael Oechsle, Maximum Wilder-Smith, Marco Hutter
Title: MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
Abstract: We present MOSAICGS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding.We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods,while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.
PaperID: 2079,   Poster  https://arxiv.org/pdf/2603.21332    
Authors: Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu
Title: EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
Abstract: Audiodriven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), which can capture emotional prosody from audio while supplementing head pose and upper-face cues absent in audio to enable expressive yet stable motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.
PaperID: 2080,   Poster  https://arxiv.org/pdf/2603.00526    
Authors: Zhen Zhou, Jian Liu, Biwen Lei, Jing Xu, Haohan Weng, Yiling Zhu, Zhuo Chen, Junfeng Fan, Yunkai Ma, Dazhao Du, Song Guo, Fengshui Jing, Chunchao Guo
Title: Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation posttraining efficiency improvement, which is 3.75× faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes. Code will be available soon.
PaperID: 2081,   Poster  https://arxiv.org/pdf/2603.09173    
Authors: Sneha Paul, Zachary Patterson, Nizar Bouguila
Title: Point Cloud as a Foreign Language for Multi-modal Large Language Model
Abstract: Multimodal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens—treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance the model’s reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment–based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: https://anonymous.4open.science/r/SAGE-3D.
PaperID: 2082,   Poster  https://arxiv.org/pdf/2603.10893    
Authors: Yuzhou Ji, Qijian Tian, He Zhu, Xiaoqi Jiang, Guangzhi Cao, Lizhuang Ma, Yuan Xie, Xin Tan
Title: S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
Abstract: Explicit 3D representations have already become an essential medium for 3D simulation and understanding.However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from nonphotorealistic rendering and significant degradation under sparse inputs.In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs.Specifically, the S2D lifting is two-fold.We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing.Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views.Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity.By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.
PaperID: 2083,   Poster  https://arxiv.org/pdf/2603.12711    
Authors: Jing Yang, Hui Xue, Shipeng Zhu, Pengfei Fang
Title: Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval
Abstract: This paper studies unsupervised crossdomain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors (TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on both the Office-Home and DomainNet datasets. Notably, when using ViT-B as the image encoder, TPSNet achieves an average P@15 improvement of 24.48% on Office-Home and a 13.86% improvement in P@200 on DomainNet. The code will be released.
PaperID: 2084,   Poster  https://arxiv.org/pdf/2603.20611    
Authors: Di Kong, Yikai Wang, Wenjie Guo, Yifan Bu, Boya Zhang, Yuexin Duan, Xiawei Yue, Wenbiao Du, Yiman Zhong, Yuwen Chen, Cheng Ma
Title: GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Abstract: Slicebased volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice‑aware piling strategy that positions anisotropic 3D Gaussians to model through‑slice contributions, (ii) a differentiable projection operator that encodes the finite‑thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real‑time rendering efficiency of Gaussian primitives while preserving high‑frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes—up to 11× faster than NeRF-based approaches—and achieves consistent 16× compression over the original voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.
PaperID: 2085,   Poster  https://arxiv.org/pdf/2602.24096    
Authors: Yuxuan Zhang, Katarina Tothova, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo de Lutio, Yen-Yu Chang, Or Litany, Sanja Fidler, Žan Gojčič
Title: DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
Abstract: Simulation is essential to the development and evaluation of autonomous robots such as selfdriving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively, is a custom data curation pipeline that constructs synthetic–real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. Experiments show that DiffusionHarmonizer substantially improves perceptual realism, being chosen by 84.28% of users in our comparative study over the second best method. Furthermore, it matches the temporal coherence of state-of-the art video models while maintaining the inference efficiency of single-step image models, offering a scalable and practical solution for photorealistic simulation in both research and production settings.
PaperID: 2086,   Poster  https://arxiv.org/pdf/2602.23141    
Authors: Kan Ren, Gang Wan, TAO LIU
Title: No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
Abstract: We propose a novel unsupervised framework for online video stabilization. Unlike deep learningbased stabilizers that require paired stable/unstable datasets, our method models the classical three-stage stabilization pipeline and integrates a multithreaded buffering mechanism, effectively addressing three key challenges of end-to-end learning: limited data, poor controllability, and inefficiency on resource-constrained hardware. Existing benchmarks mainly focus on handheld, forward-view, visible-light videos, restricting the application of stabilization in domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our approach consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to that of offline methods.
PaperID: 2087,   Poster  https://arxiv.org/pdf/2602.21858    
Authors: Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Wang Hao, haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang
Title: ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices
Abstract: Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands.The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address realworld complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries.Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.
PaperID: 2088,   Poster  https://arxiv.org/pdf/2511.18513    
Authors: HE HUANG, Yujun Guo, Wei He
Title: LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging
Abstract: Deep unfolding networks (DUNs) have achieved remarkable success and become the mainstream paradigm for spectral compressive imaging (SCI) reconstruction. Existing DUNs are derived from fullHSI imaging models, where each stage operates directly on the high-dimensional HSI, refining the entire data cube based on the single 2D coded measurement. However, this paradigm leads to computational redundancy and suffers from the ill-posed nature of mapping 2D residuals back to 3D space of HSI. In this paper, we propose two novel imaging models corresponding to the spectral basis and subspace image by explicitly integrating low-rank (LR) decomposition with the sensing model. Compared to recovering the full HSI, estimating these compact low-dimensional components significantly mitigates the ill-posedness. Building upon these novel models, we develop the Low-Rank Deep Unfolding Network (LRDUN), which jointly solves the two subproblems within an unfolded proximal gradient descent (PGD) framework. Furthermore, we introduce a Generalized Feature Unfolding Mechanism (GFUM) that decouples the physical rank in the data-fidelity term from the feature dimensionality in the prior module, enhancing the representational capacity and flexibility of the network. Extensive experiments on simulated and real datasets demonstrate that the proposed LRDUN achieves state-of-the-art (SOTA) reconstruction quality with significantly reduced computational cost.
PaperID: 2089,   Poster  https://arxiv.org/pdf/2602.21929    
Authors: JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu
Title: Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
Abstract: Sceneconsistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.
PaperID: 2090,   Poster  https://arxiv.org/pdf/2603.10578    
Authors: Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin
Title: R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
Abstract: Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: (1) existing CG datasets lack systematic descriptions of rendering quality; and (2) existing CG quality assessment methods cannot provide reasonable textbased explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3.5\mathrmK CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question–answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM’s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment. The dataset and code will be publicly released to support future research in this area.
PaperID: 2091,   Poster  https://arxiv.org/pdf/2603.01515    
Authors: Hanxiao Wang, Yuanchen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan
Title: FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
Abstract: Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertexcoordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
PaperID: 2092,   Poster  https://arxiv.org/pdf/2601.17391    
Authors: Rui Fan, Weidong Hao, Juntao Guan, Lai Rui, Tong Wu, Fanhong Zeng, Lin Gu
Title: SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Abstract: Event cameras action recognition (EAR) offers compelling privacyprotecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting H-W-T events alone spatial axis H and W, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0%, +10.7%, and +10.2% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 35.7% lower computations, establishing our framework as a novel and powerful EAR paradigm. Code will be released once accepted.
PaperID: 2093,   Poster  https://arxiv.org/pdf/2603.29185    
Authors: Huaqi Tao, Bingxi Liu, Guangcheng Chen, Fulin Tang, Li He, Hong Zhang
Title: Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
Abstract: Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While pointbased hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc significantly enhances the robustness of visual relocalization, setting a new state-of-the-art. The code will be released upon acceptance.
PaperID: 2094,   Poster  https://arxiv.org/pdf/2601.16788    
Authors: Xuewei Li, Xinghan Bao, Zhimin Chen, Xi Li
Title: REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
Abstract: As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultrawide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion (SMMF). REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.
PaperID: 2095,   Poster  https://arxiv.org/pdf/2511.12370    
Authors: Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald G. Dansereau, Niko Suenderhauf, Dimity Miller
Title: Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
Abstract: Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is poseagnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines. Code will be released upon acceptance.
PaperID: 2096,   Poster  https://arxiv.org/pdf/2603.11106    
Authors: Shijie Zhou, Bin Zhu, Jiarui Yang, Xiangyu Zhao, Jingjing Chen, Yu-Gang Jiang
Title: RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
Abstract: Recent advances in VisionLanguage-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow(RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., Pi0) by triggering task replanning for task-level OOD and task rollback for state-level OOD within 100 ms. These results have demonstrated that our RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.
PaperID: 2097,   Poster  https://arxiv.org/pdf/2602.18867    
Authors: Zhuofan Xie, Zishan Lin, Jinliang Lin, Jie Qi, Shaohua Hong, Shuo Li
Title: Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
Abstract: Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from coldstart when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text–image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.
PaperID: 2098,   Poster  https://arxiv.org/pdf/2512.11099    
Authors: Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu
Title: VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs autoregressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM’s pretrained reasoning ability. In contrast, we proposeVGent, a modular encoder–decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i)QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii)mask-aware labelfor resolving detection–segmentation ambiguity; and (iii)global target recognitionto improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with+20.6%F1 improvement over prior methods, and further boosts gIoU by+8.2%and cIoU by+5.8%under visual reference challenges, while maintaining constant, fast inference latency.
PaperID: 2099,   Poster  https://arxiv.org/pdf/2511.23002    
Authors: yunlong lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, qinglin lu
Title: JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Abstract: Agentbased editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination—text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking—dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor–evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both preservative and generative editing through seamless integration of Adobe Lightroom and Qwen-Image-Edit tools. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity, while maintaining competitive performance in generative editing tasks.
PaperID: 2100,   Poster  https://arxiv.org/pdf/2511.04570    
Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
Title: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Abstract: Thinking with Text and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpassing GPT5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a unified multimodal reasoning paradigm.
PaperID: 2101,   Poster  https://arxiv.org/pdf/2604.08543    
Authors: Mayur Deshmukh, Hiroyasu Akada, Helge Rhodin, Christian Theobalt, Vladislav Golyanik
Title: E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
Abstract: Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from headmounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7×. Our source code will be publicly released upon publication.
PaperID: 2102,   Poster  https://arxiv.org/pdf/2603.00550    
Authors: Yu Wang, Hongli Liu
Title: Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
Abstract: Weakly supervised video anomaly detection (WSVAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
PaperID: 2103,   Poster  https://arxiv.org/pdf/2603.25767    
Authors: Xuanru Zhou, Yiwen Shao, Wei-Cheng Tseng, Dong Yu
Title: Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
Abstract: Current audio pretraining seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision's foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.
PaperID: 2104,   Poster  https://arxiv.org/pdf/2510.03101    
Authors: Irene Tenison, Soumyajit Chatterjee, Fahim Kawsar, Mohammad Malekzadeh
Title: AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
Abstract: To utilize pretrained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining; however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training, limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers, followed by important channels of these layers, by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers and channels with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets shows AdaBet achieves an average gain of 5% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40%. We open-source our code at \urlhttps://anonymous.4open.science/r/adabet-37CF/.
PaperID: 2105,   Poster  https://arxiv.org/pdf/2604.04372    
Authors: Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao
Title: Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
Abstract: When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multiclip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We presentGraph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.
PaperID: 2106,   Poster  https://arxiv.org/pdf/2601.13719    
Authors: Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu
Title: Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Abstract: Long video understanding presents significant challenges for visionlanguage models due to extremely long context windows.Existing solutions rely on naive chunking strategies with retrieval-augmented generation, suffer from information fragmentation and a loss of global coherence. We propose a unified framework that achieves coherent and comprehensive understanding of long videos.Our approach overcome limitations of current solutions by combining audiovisual entity cohesion with hierarchical video indexing and agentic search. First, we preserves semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 81.0% on LVBench. Notably, it delivers exceptional performance in the challenging reasoning category (79.6%) and achieves 86.7% in temporal grounding. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
PaperID: 2107,   Poster  https://arxiv.org/pdf/2503.07516    
Authors: Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su
Title: Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
Abstract: Referring MultiObject Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the Supplementary Materials.
PaperID: 2108,   Poster  https://arxiv.org/pdf/2512.03963    
Authors: Tao Wu, Li Yang, Gen Zhan, Yabin ZHANG, Yiting Liao, Junlin Li, Deliang Fu, Li zhang, Limin Wang
Title: TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
Abstract: Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing longform video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs’ temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
PaperID: 2109,   Poster  https://arxiv.org/pdf/2602.21992    
Authors: Zekai Lin, Xu Zheng
Title: PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Abstract: 360° panoramic images are increasingly used in VR, autonomous driving, and robotics for holistic scene understanding. However, current Vision–Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a largescale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations—depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward combining five geometry-aware strategies (e.g., distance tolerance, spatial consistency). A two-stage curriculum further mitigates catastrophic forgetting: Stage\~1 trains on structured tasks (T/F, MCQ), and Stage\~2 fine-tunes on mixed OE data for generalization. Our 7B model sets a new SoTA performance, improving total accuracy to 52.93% (+3.59%) and OE accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
PaperID: 2110,   Poster  https://arxiv.org/pdf/2601.01618    
Authors: Huajie Tan, Peterson Co, Yijie Xu, Shanyu Rong, Yuheng Ji, Cheng Chi, Xiansheng Chen, Zhongxia Zhao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Title: Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation
Abstract: Longhorizon, open-world robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision–Language–Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, a lightweight visual intermediate that renders points, boxes, arrows, and typed relations on the robot’s current views to externalize spatial intent, bind language to scene geometry, and provide a human-verifiable bridge between high-level reasoning and low-level control. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See \rightarrow Think \rightarrow Sketch \rightarrow Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate a 2.3M-sample corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Experiments on cluttered tabletops and multi-object tasks, in simulation and on real robots, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans.
PaperID: 2111,   Poster  https://arxiv.org/pdf/2511.15258    
Authors: Yitong Yang, Yinglin Wang, Changshuo Wang, Yongjun Zhang, Ziyang Chen, Shuting He
Title: SplitFlux: Learning to Decouple Content and Style from a Single Image
Abstract: Disentangling image content and style is essential for customized image generation. Existing SDXLbased methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content–style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.
PaperID: 2112,   Poster  https://arxiv.org/pdf/2602.22666    
Authors: Xuelu Li, Zhaonan Wang, Xiaogang Wang, Lei Wu, Manyi Li, Changhe Tu
Title: ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
Abstract: Reconstructing articulated objects into highfidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.
PaperID: 2113,   Poster  https://arxiv.org/pdf/2512.14180    
Authors: Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi
Title: Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
Abstract: Radiance field methods (e.g.~3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations.SH struggle with highfrequency signals, exhibit Gibbs ringing artifacts, and critically fail to capture specular reflections -- a key component of realistic rendering. While alternatives like Spherical Gaussians offer improvements, they introduce significant optimization complexity.We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting.SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while maintaining simpler optimization compared to existing alternatives. For reflections -- where SH fundamentally fail -- we leverage SV as learnable reflection probes, taking reflected directions as input following principles from traditional graphics. This formulation achieves state-of-the-art results across both synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.
PaperID: 2114,   Poster  https://arxiv.org/pdf/2604.15678    
Authors: Eunju Lee, MiHyeon Kim, Junehyoung Kwon, Yoonji Lee, JiHyun Kim, Soojin Jang, YoungBin Kim
Title: HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
Abstract: Pretrained Vision–Language Models (VLMs) like CLIP show promise in continual learning, but existing FewShot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties—directional alignment and covariance-aware magnitude—yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention–adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.
PaperID: 2115,   Poster  https://arxiv.org/pdf/2602.04268    
Authors: Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He
Title: KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Abstract: Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination, which corresponds to the generation of visually inconsistent objects, attributes, or relations, remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs; However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases.To address this, we propose KVSmooth, a training-free, plug-and-play method that mitigates hallucination by performing attention–entropy–guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache while dynamically quantifying the sink degree of each token through its attention distribution entropy to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination (\mathitCHAIR_S from 41.8 \rightarrow 18.2) while improving overall performance (F_1 score from 77.5 \rightarrow 79.2), achieving higher precision and recall simultaneously, whereas prior methods often sacrifice one for the other, thereby validating the effectiveness and generality of our method.
PaperID: 2116,   Poster  https://arxiv.org/pdf/2503.06100    
Authors: Xianjie Liu, Keren Fu, Qijun Zhao
Title: High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy
Abstract: Highprecision dichotomous image segmentation (DIS) is a task of extracting fine-grained objects from high-resolution images.Existing methods trade efficiency for accuracy: non-diffusion methods are fast but suffer from weak semantics and unstable spatial priors, causing false detections; diffusion-based methods offer high accuracy via strong generative priors but are computationally expensive.In depth maps, a complete object appears as a low variance region with a smooth interior and sharp boundaries, whereas the background exhibits a chaotic, high variance pattern due to disconnected surfaces at varying depths. We refer to this as the depth integrity-prior.Inspired by this, and noting that DIS currently lacks depth maps, we leverage pseudo-depth information from monocular depth estimation models to obtain essential semantic understanding, thereby rapidly revealing spatial differences across target objects and the background.To exploit this prior, we propose the Prior-guided Depth Fusion Network (PDFNet), which fuses RGB and pseudo-depth features for depth-aware structure perception. We further introduce a novel depth integrity-prior loss to enforce depth consistency in segmentation and a fine-grained enhancement module with adaptive patch selection to sharpen boundaries.Notably, PDFNet with DAM-v2 achieves SOTA (F^max_\beta 0.915 on DIS-VD and 0.915 on DIS-TE) using less than half the params of diffusion-based methods.Code is provided in the supplementary.
PaperID: 2117,   Poster  https://arxiv.org/pdf/2602.18845    
Authors: Chengwei Xia, Fan Ma, Ruijie Quan, Yunqiu Xu, Kun Zhan, Yi Yang
Title: Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
Abstract: With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownershiprelated textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This model is trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.
PaperID: 2118,   Poster  https://arxiv.org/pdf/2511.20295    
Authors: Chao Wang, chengan che, Xinyue Chen, Sophia Tsoka, Luis Carlos Garcia Peraza Herrera
Title: Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. Stateof-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.
PaperID: 2119,   Poster  https://arxiv.org/pdf/2604.04500    
Authors: Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou
Title: Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Abstract: Visionlanguage models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance. The dataset, code and pretrained models of this paper will be released.
PaperID: 2120,   Poster  https://arxiv.org/pdf/2604.19432    
Authors: Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai, Yang Zhou, Jingbo Xia, Yulong Wang, Jinhai Xiang, Xiang Bai
Title: DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
Abstract: Vision foundation models have shown great promise for openset 3D object retrieval (3DOR) through efficient adaptation to multi-view images.Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder—DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes.To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.
PaperID: 2121,   Poster  https://arxiv.org/pdf/2602.18887    
Authors: Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, Jun Won Choi
Title: SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
Abstract: The endto-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.6% driving score.
PaperID: 2122,   Poster  https://arxiv.org/pdf/2603.01099    
Authors: Jiashu Li, Xumeng Han, Zhaoyang Wei, Zipeng Wang, Kuiran Wang, Guorong Li, Zhenjun Han, Jianbin Jiao
Title: HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with realtime efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions—characterized by globally sparse coverage, blurred background, and distorted high-frequency areas.To address this, we propose HeroGS—Hierarchical Guidance for Robust 3D Gaussian Splatting—a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions.The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality.Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
PaperID: 2123,   Poster  https://arxiv.org/pdf/2604.12537    
Authors: Ruoxiang Huang, Zhen Yuan
Title: MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Abstract: VisionLanguage Models (VLMs) have demonstrated remarkable capabilities in multimodal understanding, yet their positional encoding mechanisms remain fundamentally limited. Current approaches apply uniform positional indices across all tokens, failing to account for dramatic variations in information density between and within modalities. This uniform treatment leads to suboptimal attention allocation and inefficient cross-modal fusion. We introduce MODIX (Multimodal Information-Driven Positional Index Scaling), a training-free framework that dynamically adapts positional granularity based on information-theoretic analysis of modality contributions. By jointly quantifying intrinsic information density within each modality and cross-modal interaction strength, MODIX assigns finer positional strides to information-rich content and coarser strides to redundant regions. Operating purely at inference time, our method requires no architectural modifications or retraining, enabling plug-and-play integration with existing VLMs. Comprehensive experiments across multiple state-of-the-art architectures and six benchmarks demonstrate that MODIX consistently improves multimodal reasoning, achieving up to 8.4% gains on ScienceQA and 6.8% on RealWorldQA, while dynamically adapting positional resolution to task-specific information distributions.
PaperID: 2124,   Poster  https://arxiv.org/pdf/2603.19610    
Authors: Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang
Title: ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Abstract: Although current VideoLLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporates an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention‑guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by 1.6~1.8× with high accepted lengths, and accelerates various video understanding benchmarks by 3.36× on LLaVA-Onevision-72B and 2.42× on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
PaperID: 2125,   Poster  https://arxiv.org/pdf/2603.23711    
Authors: morui zhu, Yongqi Zhu, Song Fu, Qing Yang
Title: Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
Abstract: Autonomous trucking poses unique challenges due to articulated tractor–trailer geometry, and timevarying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.
PaperID: 2126,   Poster  https://arxiv.org/pdf/2603.27665    
Authors: Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Abstract: Existing generative models, such as diffusion and autoregressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model’s weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization. The code will be available at \urlhttps://anonymous.4open.science/r/Composer-IPC.
PaperID: 2127,   Poster  https://arxiv.org/pdf/2603.24030    
Authors: SA ZHU, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui, Jinchao Zhang, Bo Li
Title: Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
Abstract: OpenVocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual–textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.
PaperID: 2128,   Poster  https://arxiv.org/pdf/2603.26071    
Authors: Kyungwon Kim, Dosik Hwang
Title: MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
Abstract: Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose a novel framework that explicitly decomposes each modality's representation into modalityspecific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables us to identify precisely what information is missing when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that the proposed method achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions.
PaperID: 2129,   Poster  https://arxiv.org/pdf/2508.13309    
Authors: Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty
Title: DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
Abstract: Numerous techniques have been proposed for generating adversarial examples under strict \ \ell_p \norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from \ \ell_p \-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing \ \ell_p \-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained robust models across CIFAR-10, CIFAR-100, and ImageNet while considering visual perception metrics (e.g. SSIM, FID, LPIPS) in the perturbation budget (instead of \ \ell_p \-norm). Despite relying solely on \ \ell_p \-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements of \approx11, 0.015, and 5.7, respectively). DASH generalizes well to unseen defenses and different white-box/black-box scenarios, making it a practical and strong baseline for evaluating robustness.
PaperID: 2130,   Poster  https://arxiv.org/pdf/2603.21085    
Authors: Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, Shuhang Gu
Title: Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Abstract: Latent diffusion models have emerged as the dominant framework for highfidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to diffusion sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used \beta-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
PaperID: 2131,   Poster  https://arxiv.org/pdf/2602.18792    
Authors: Changlu Guo, Anders Nymark Christensen, Anders Dahl, Morten Hannemose
Title: MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Abstract: Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model’s prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusionbased counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30× faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation. Our code will be publicly available upon acceptance.
PaperID: 2132,   Poster  https://arxiv.org/pdf/2508.13305    
Authors: Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, JUNYUAN ZHANG, Weijia Li, Conghui He, Linfeng Zhang
Title: Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Abstract: VisionLanguage Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their real-world deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images—a standard setup in AD systems that utilize six or even more synchronized cameras to perceive the environment comprehensively. This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores; and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, demonstrate that Prune2Drive achieves significant speedups and memory savings while maintaining—and in some cases improving—task performance. Our results establish Prune2Drive as a practical and generalizable solution for efficient vision-language reasoning in autonomous driving. When retaining only 10% of the visual tokens, our method achieves a 6.40× speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% average performance drop compared to the original model on the DriveLM benchmark.
PaperID: 2133,   Poster  https://arxiv.org/pdf/2604.05780    
Authors: Yu Xue, Longjun Gao, Yuanqi Su, HaoAng Lu, Xiaoning Zhang
Title: Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
Abstract: Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a costeffective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions—where over 93% of voxels are empty and foreground classes are rare—poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.
PaperID: 2134,   Poster  https://arxiv.org/pdf/2604.00849    
Authors: Shuang Li, Chao Deng, Hang Chen, Liqun Liu, zhenyu hu, Cao Te, Mengge Xue, Yuan Chen, Peng Shu, Huan Yu, Jie Jiang
Title: Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
Abstract: SubjectDriven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the ``similarity-controllability paradox'', where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disentangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.
PaperID: 2135,   Poster  https://arxiv.org/pdf/2603.01063    
Authors: Yuechen Luo, Fang Li, Qimao Chen, Shaoqing Xu, Jiaxin Liu, Ziying Song, Zhixin Yang, Fuxi Wen
Title: Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
Abstract: VisionLanguage-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to "persistent failures" in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause—whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we proposeVLAwithExplicitLearning fromFailures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate aFeedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public Navsim benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
PaperID: 2136,   Poster  https://arxiv.org/pdf/2509.18600    
Authors: Zhuoxiao Chen, Hongyang Yu, Ying Xu, Yadan Luo, Long Duong, Yuan-Fang Li
Title: OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
Abstract: Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest Xray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO (OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.
PaperID: 2137,   Poster  https://arxiv.org/pdf/2604.05931    
Authors: Jingbo Sun, Qichao Zhang, Songjun Tu, Xing Fang, Yupeng Zheng, Haoran Li, Ke Chen, Dongbin Zhao
Title: Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning
Abstract: Zeroshot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual
PaperID: 2138,   Poster  https://arxiv.org/pdf/2603.16944    
Authors: Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao, Haopeng Jin, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Haijin Liang, Jin Ma, Xinming Wang, RuiwenTao RuiwenTao, Hongzhu Yi
Title: Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
Abstract: While Instructionbased Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
PaperID: 2139,   Poster  https://arxiv.org/pdf/2510.00430    
Authors: Suhyeon Lee, Jong Chul Ye
Title: PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
Abstract: Despite the recent progress, reinforcement learning (RL)based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, here we introducePromptLoop, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking while introducing only a practically negligible inference overhead.
PaperID: 2140,   Poster  https://arxiv.org/pdf/2511.19878    
Authors: Chengyue Huang, Mellon Zhang, Robert Azarcon, Glen Chou, Zsolt Kira
Title: MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
Abstract: VisionLanguage-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naïve fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS is parameter-free, data-free, and plug-and-play with existing architectures. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and benchmarks like LIBERO, CALVIN, and SimplerEnv, MAPS boosts both in- and out-of-distribution performance (up to +25%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for scalable VLA adaptation.
PaperID: 2141,   Poster  https://arxiv.org/pdf/2602.20583    
Authors: Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim
Title: PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
Abstract: Propagationbased video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures.However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire.Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation.Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation.Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations.Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.
PaperID: 2142,   Poster  https://arxiv.org/pdf/2604.04379    
Authors: Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao
Title: Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
Abstract: Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidencealigned. We introduceReinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. InRLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. InRLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3 % over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.
PaperID: 2143,   Poster  https://arxiv.org/pdf/2603.00490    
Authors: Hengjian Gao, Kaiwei Zhang, Shibo Wang, Mingjie Chen, Caoqihang Caoqihang, Xianfeng Wang, Yucheng Zhu, Xiongkuo Min, Wei Sun, Dandan Zhu, Guangtao Zhai
Title: LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, realworld environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human–AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human–assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question–answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.
PaperID: 2144,   Poster  https://arxiv.org/pdf/2512.20557    
Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
Title: Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Abstract: Visionlanguage models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduceDSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction ofDSR-Trainfor learning and further human-refinedDSR-Benchfor evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
PaperID: 2145,   Poster  https://arxiv.org/pdf/2511.23225    
Authors: Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, Jianxin Wu
Title: TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Abstract: Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixedprecision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are adata-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enablesfull-model FP8 pre-trainingwith neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendlyW8A8 per-tensor static quantizationof LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.
PaperID: 2146,   Poster  https://arxiv.org/pdf/2509.15130    
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
Title: Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Abstract: Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatialtemporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. These components work in concert to inject precise, trajectory-aligned control without any model retraining, achieving both accurate motion guidance and photorealistic synthesis. Our plug-and-play, model-agnostic solution demonstrates broad applicability for 3D/4D tasks. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and other inference-only methods.
PaperID: 2147,   Poster  https://arxiv.org/pdf/2601.08325    
Authors: Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu
Title: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Abstract: Recent advances in robot manipulation have leveraged pretrained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages:(1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness.(2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation.Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
PaperID: 2148,   Poster  https://arxiv.org/pdf/2603.10335    
Authors: Yuedong Yang, Xiwen Wei, Mustafa Munir, Radu Marculescu
Title: Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models
Abstract: Reasoning Large Multimodality Models (LMMs) have become the de facto choice for many applications.However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking).We observe empirically that the CoT process follows a Bernoulli process, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process.Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking.Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37× reduction in the memory allocation frequency.
PaperID: 2149,   Poster  https://arxiv.org/pdf/2603.20721    
Authors: Yifei Deng, Chenglong Li, YUYANG ZHANG, Guyue Hu, Jin Tang
Title: Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
Abstract: Textaerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text–image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network (CFANet), which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text–aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporates the ground-view agent as a bridge in text–aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method. The code and dataset will be publicly released.
PaperID: 2150,   Poster  https://arxiv.org/pdf/2211.16780    
Authors: Quyen Tran, Ngoc-Hai Nguyen, Minh Quan Dao, Hoang Phan, Linh Ngo Van, Khoat Than, Dinh Phung, Dimitris N. Metaxas, Trung Le
Title: An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
Abstract: In online incremental learning, data continuously arrives with substantial shifts in distribution, creating a significant challenge since previous samples cannot be revisited. Prior research has typically relied on either a single adaptive centroid or fixed multiple centroids to represent each class in the latent space. However, such methods struggle when class data streams are inherently multimodal and require continual centroid updates. To overcome this, we introduce an online Mixture Model learning framework grounded in Optimal Transport theory (MMOT), where centroids evolve incrementally with new data. This approach offers two main advantages: (i) it provides a more precise characterization of complex data streams, and (ii) it enables improved class similarity estimation for unseen samples during inference through MMOTderived centroids. Furthermore, to strengthen representation learning and mitigate catastrophic forgetting, we design a Dynamic Preservation strategy that regulates the latent space and maintains class separability over time. Experimental evaluations on benchmark datasets confirm the superior effectiveness of our proposed method.
PaperID: 2151,   Poster  https://arxiv.org/pdf/2603.15800    
Authors: Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie
Title: Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image–text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code will be made publicly accessible upon acceptance.
PaperID: 2152,   Poster  https://arxiv.org/pdf/2603.23495    
Authors: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos
Title: VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Abstract: Existing approaches for improving the efficiency of Large VisionLanguage Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
PaperID: 2153,   Poster  https://arxiv.org/pdf/2505.22564    
Authors: Jaehyun Choi, Jiwan Hur, Gyojin Han, Jaemyung Yu, Junmo Kim
Title: PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
Abstract: Video dataset condensation aims to mitigate the immense computational cost of video processing, but faces the unique challenge of preserving the complex interplay between spatial content and temporal dynamics. Prior work often unnaturally disentangles these elements, overlooking their essential interdependence. We introduce Progressive Refinement and Insertion for Sparse Motion (PRISM), a novel approach that preserves this critical coupling. PRISM begins with a minimal set of key frames and dynamically synthesizes new ones by identifying moments of high motion complexity, where simple interpolation fails, through gradient misalignments. This adaptive process allocates new frames only where such complexity exists, creating highly efficient and temporally coherent synthetic datasets. Extensive experiments show PRISM achieves highly competitive performance on standard action recognition benchmarks, often matching or exceeding prior methods, while creating powerful representations with significantly less storage
PaperID: 2154,   Poster  https://arxiv.org/pdf/2602.20792    
Authors: Muhammad Saif Ullah Khan, Didier Stricker
Title: SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
Abstract: Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine’s complex multijoint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full‑body motions in indoor multi‑camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking.Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.
PaperID: 2155,   Poster  https://arxiv.org/pdf/2603.03857    
Authors: Yangfu Li, Hongjian Zhan, Jiawei Chen, YUNING GONG, Qi Liu, Yue Lu
Title: DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Abstract: Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottomup manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision–Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost. The code will be open-source soon.
PaperID: 2156,   Poster  https://arxiv.org/pdf/2603.26092    
Authors: Youngjun Song, Hyeongyu Kim, Dosik Hwang
Title: CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
Abstract: Testtime adaptation (TTA) enables real-time adaptation to domain shifts without offline retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Very recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This motivates a fundamental question: can we adaptively balance both strategies based on measured feature-level domain shift?We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.
PaperID: 2157,   Poster  https://arxiv.org/pdf/2603.03762    
Authors: Junhan Chen, Zilu Zhou, Yujun Tong, Dongliang Chang, Yitao Luo, Zhanyu Ma
Title: Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Abstract: Finegrained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions.We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning.Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval–grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.
PaperID: 2158,   Poster  https://arxiv.org/pdf/2511.09675    
Authors: Felix B Mueller, Jan Frederik Meier, Timo Lüddecke, Richard Vogg, Roger Freixanet, Valentin Hassler, Tiffany Bosshard, Elif Karakoc, William O'Hearn, Sofia Pereira, Sandro Sehner, Kaja Wierucka, Judith Burkart, Claudia Fichtel, Julia Fischer, Alexander Gail, Catherine Hobaiter, Julia Ostner, Liran Samuni, Oliver Schülke, Neda Shahidi, Erin G. Wessling, Alexander Ecker
Title: PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
Abstract: Nonhuman primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We continue pretraining V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets — ChimpACT, PanAf500, BaboonLand, and ChimpBehave — our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.
PaperID: 2159,   Poster  https://arxiv.org/pdf/2511.18734    
Authors: Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li
Title: Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
Abstract: Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless cityscale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical “City–District–Grid” structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a produce–refine–evaluate isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph–based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.
PaperID: 2160,   Poster  https://arxiv.org/pdf/2511.21192    
Authors: Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong Jiang
Title: When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an \ell_1 deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text\tovision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
PaperID: 2161,   Poster  https://arxiv.org/pdf/2508.18753    
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
Title: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Abstract: Humanobject interaction (HOI) detection has traditionally been addressed using task-specific models, sometimes augmented by early vision-language models such as CLIP. With the emergence of large, generative VLMs, a natural question arises: can standalone VLMs perform HOI detection effectively, and how do they compare to specialized HOI methods? Existing benchmarks like HICO-DET rely on exact label matching under incomplete annotations, counting any unmatched prediction as wrong. This leads to incorrect penalization, especially for VLMs whose outputs are less constrained, making fair comparison between the two paradigms difficult. To address this limitation, we introduce a multi-choice HOI benchmark with explicitly defined positives and curated negatives, enabling unified and correct evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.
PaperID: 2162,   Poster  https://arxiv.org/pdf/2603.05769    
Authors: Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, An-An Liu
Title: Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Abstract: Regioninstructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.
PaperID: 2163,   Poster  https://arxiv.org/pdf/2604.08627    
Authors: Yongchan Chun, Chanhee Park, Jeongho Yoon, Jaehyung Seo, Heuiseok Lim
Title: Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Uncertainty Estimation
Abstract: Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods—such as deep ensembles and MC dropout—are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks.To enable EDLstyle uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation.We evaluate ETN on image classification and large language model question-answering benchmarks, under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines, while preserving accuracy and adding only minimal computational overhead.
PaperID: 2164,   Poster  https://arxiv.org/pdf/2512.02729    
Authors: Yuhong Zhang, Zihan Gao, Shengpeng Li, Ling-Hao Chen, Kaisheng Liu, Runqing Cheng, Xiao Lin, Junjia Liu, Zhuoheng Li, Jingyi Feng, Ziyan He, Jintian Lin, Zheyan Huang, Zhifang Liu, Haoqian Wang
Title: RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning
Abstract: We introduce Robowheel, a data engine that converts human hand–object interaction (HOI) videos into trainingready supervision for cross-morphology robotic learning. From monocular RGB/RGB-D inputs, we perform high-precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand–object relative poses under contact and penetration constraints. The reconstructed, contact-rich trajectories are then retargeted to cross-embodiments, robot arms with simple end-effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end-to-end pipeline from video → reconstruction → retargeting → augmentation → data acquisition.We validate the data on mainstream vision–language–action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight: a single monocular RGB(D) camera is sufficient to extract a universal, embodiment-agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large-scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.
PaperID: 2165,   Poster  https://arxiv.org/pdf/2601.01483    
Authors: Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu
Title: Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
Abstract: Parallel testtime scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
PaperID: 2166,   Poster  https://arxiv.org/pdf/2602.20412    
Authors: Aayush Dhakal, Subash Khanal, Srikumar Sastry, Jacob Arndt, Philipe Ambrozio Dias, Dalton Lunga, Nathan Jacobs
Title: SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
Abstract: The rapid advancement of generative models has made the detection of AIgenerated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85% accuracy and +69.62% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.
PaperID: 2167,   Poster  https://arxiv.org/pdf/2604.20715    
Authors: Yuxuan Xue, Ruofan Liang, Egor Zakharov, Timur Bagautdinov, Chen Cao, Giljoo Nam, Shunsuke Saito, Gerard Pons-Moll, Javier Romero
Title: GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
Abstract: Relighting a person from a single photo is an attractive but illposed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both:GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.
PaperID: 2168,   Poster  https://arxiv.org/pdf/2604.05497    
Authors: Keuntae Kim, Mingyu Kang, Yong Suk Choi
Title: Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Abstract: Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chainof-Thought (CoT) reasoning, dMLLMs exhibit two critical issues.First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs.To address these limitations, we propose Position & Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by the classifier-free guidance, amplifies visual grounding signals to enhance the model’s alignment with visual evidence.Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3× speedup compared to reasoning with four times more diffusion steps. Our code will be released after publication.
PaperID: 2169,   Poster  https://arxiv.org/pdf/2512.20409    
Authors: Junho Yoon, Jaemo Jeong, Hyunju Kim, Dongman Lee
Title: Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
Abstract: Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a nonintrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
PaperID: 2170,   Poster  https://arxiv.org/pdf/2604.00969    
Authors: Yiyao Zhu, Ying Xue, Haiming Zhang, Guangfeng Jiang, Wending Zhou, Xu Yan, Jiantao Gao, Yingjie CAI, Bingbing Liu, Zhen Li, Shaojie Shen
Title: DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Abstract: Visionbased autonomous driving has gained much attention due to its low costs and excellent performance.Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.
PaperID: 2171,   Poster  https://arxiv.org/pdf/2508.20066    
Authors: Zheng Li, Xueyi Zhang, Yanming Guo, Yuxiang Xie, Ding Zhaoyun, Siqi Cai, Haizhou Li, Mingrui Lao
Title: PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Abstract: Crossview geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematicalignment shiftswhere only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of theNoisy Correspondence in Cross-View Geo-Localization (NC-CVGL)problem, specifically addressing the practical scenario where a significant portion of training pairs exhibit spatial misalignment due to GPS inaccuracies. To this end, we proposePAUL(Partition andAugmentation byUncertaintyLearning), a framework that achieves noise-robust learning through three coordinated mechanisms:Co-partitionseparates noisy from clean samples using data uncertainty and loss patterns;Co-augmentationenhances high-confidence regions via local assessment; andCo-trainingrefines learning through mutual supervision between dual networks.Unlike conventional noise-handling methods that filter or relabel noisy samples, PAUL effectively utilizes noisy data through this triple collaborative mechanism. Comprehensive experiments validate the effectiveness of individual components in PAUL, which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.
PaperID: 2172,   Poster  https://arxiv.org/pdf/2602.19944    
Authors: Yilong Yang, Jianxin Tian, Shengchuan Zhang, Liujuan Cao
Title: Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
Abstract: Current zeroshot camouflaged object segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the Discover-Segment-Select (DSS) mechanism, a three-stage framework that progressively refines the segmentation process. The proposed method contains a Feature-driven Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance with lower GPU memory consumption.
PaperID: 2173,   Poster  https://arxiv.org/pdf/2601.05688    
Authors: Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu
Title: SketchVL: Policy Optimization via FineGrained Credit Assignment for Chart Understanding and More
Abstract: Charts are highdensity visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.
PaperID: 2174,   Poster  https://arxiv.org/pdf/2602.21754    
Authors: Aditya Ranjan Dash, Ramy Battrawy, René Schuster, Didier Stricker
Title: LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
Abstract: Advanced autonomous systems rely on multisensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.
PaperID: 2175,   Poster  https://arxiv.org/pdf/2601.07396    
Authors: Guantao Chen, Shikang Zheng, Yuqi Lin, Linfeng Zhang
Title: Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers
Abstract: Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, lowenergy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55× speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.
PaperID: 2176,   Poster  https://arxiv.org/pdf/2511.13945    
Authors: Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel
Title: Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
Abstract: Transformers show remarkable versatility across domains, suggesting the existence of generic inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurallygenerated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.
PaperID: 2177,   Poster  https://arxiv.org/pdf/2603.17779    
Authors: Yizheng Song, Yiyu Zhuang, Qipeng Xu, Haixiang Wang, Jiahe Zhu, Jing Tian, Siyu Zhu, Hao Zhu
Title: CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
Abstract: Singleview 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs.Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.
PaperID: 2178,   Poster  https://arxiv.org/pdf/2603.29080    
Authors: Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss
Title: Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Abstract: Many modern multimodal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss will lead to a representation in which the two modalities are separated by a global gap vector that is orthogonal to the embeddings of both modalities. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when small, semantically inconsequential changes are made to the input. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss to clean accuracy.
PaperID: 2179,   Poster  https://arxiv.org/pdf/2603.05437    
Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, minju Jeon, HyunGee Kim, Dong-Jin Kim
Title: SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Abstract: WeaklySupervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focus merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity-aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
PaperID: 2180,   Poster  https://arxiv.org/pdf/2512.17206    
Authors: Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng
Title: Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
Abstract: Exploration capacity shapes both inference‑time performance and reinforcement learning (RL) training for large (vision) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question–answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
PaperID: 2181,   Poster  https://arxiv.org/pdf/2602.22727    
Authors: Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao, Jitao Sang
Title: HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Abstract: Object hallucination in Large VisionLanguage Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces—visual evidence, conflicting priors, and residual uncertainty—enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
PaperID: 2182,   Poster  https://arxiv.org/pdf/2603.22466    
Authors: Weitong Cai, Hang Zhang, Yukai Huang, Shitong Sun, Jiankang Deng, Songcen Xu, Jifei Song, Zhensong Zhang
Title: Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
Abstract: Alwayson sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.
PaperID: 2183,   Poster  https://arxiv.org/pdf/2603.26984    
Authors: Mujtaba Hussain Mirza, Antonio D Orazio, Odelia Melamed, Iacopo Masi
Title: A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Abstract: Despite the rapid progress in multimodal models and Large VisualLanguage Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we proposeEnergy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code will be released upon acceptance of the paper.
PaperID: 2184,   Poster  https://arxiv.org/pdf/2511.13026    
Authors: Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan
Title: REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Abstract: Selfreflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1) long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
PaperID: 2185,   Poster  https://arxiv.org/pdf/2604.04490    
Authors: Anuvab Sen, Mir Sayeed Mohammad, Saibal Mukhopadhyay
Title: RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
Abstract: We introduce RAVEN, a deep learning architecture for processing frequencymodulated continuous-wave (FMCW) radar data that is designed for high computational efficiency. RAVEN reduces computation by using a learnable antenna mixer module on independent receiver state space encoders (SSM) to compress the virtual MIMO array into a compact set of learned features and by performing per-chirp inference with a calibrated early-exit rule, so the model reaches a decision using only a subset of chirps in a radar frame. These design choices yield up to 170× lower computation and 4× lower end-to-end latency than conventional frame-based radar backbones, while achieving state-of-the-art detection and BEV free-space segmentation performance on automotive radar datasets.
PaperID: 2186,   Poster  https://arxiv.org/pdf/2603.07590    
Authors: Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, Xiande Huang
Title: Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
Abstract: Despite the rapid progress of Large VisionLanguage Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose \ours, a simple yet effective single-query jailbreak framework under black-box settings. \ours decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), \ours exploits LVLMs’ reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models across two widely used benchmarks demonstrate the effectiveness of our proposed \ours.
PaperID: 2187,   Poster  https://arxiv.org/pdf/2604.03179    
Authors: Gengwei Zhang, Jie Peng, Zhen Tan, Mufan Qiu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Yanyong Zhang, Tianlong Chen
Title: Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
Abstract: The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for posttraining Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from and reason with visual information. In this work, we propose theHallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of reasoning datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can achieve comparable or even better performance to standard training. These findings challenge prevailing assumptions about MLLM reasoning and motivate the development of more modality-aware RL-based post-training strategies.
PaperID: 2188,   Poster  https://arxiv.org/pdf/2602.12127    
Authors: Sixiang Chen, Jianyu LAI, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, Lei Zhu
Title: PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
Abstract: Imageto-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image–prompt control.To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data–distillation–reward pipeline, which includes:(i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation;(ii) distilling knowledge between local and global experts for supervised fine-tuning; and(iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks.Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.
PaperID: 2189,   Poster  https://arxiv.org/pdf/2512.13043    
Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
Title: GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Abstract: Multiturn reinforcement learning (RL) for multi-modal agents built upon vision–language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10–30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
PaperID: 2190,   Poster  https://arxiv.org/pdf/2602.22716    
Authors: Koonting Yip, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai LI, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, Ka-Veng Yuen
Title: SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
Abstract: 3D Large VisionLanguage Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate–based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
PaperID: 2191,   Poster  https://arxiv.org/pdf/2602.23802    
Authors: Yiyang Fang, Wenke Huang, Pei Fu, Yihao Yang, Kehua Su, Zhenbo Luo, Jian Luan, Mang Ye
Title: EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised finetuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition.To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.
PaperID: 2192,   Poster  https://arxiv.org/pdf/2603.20739    
Authors: Jincen Jiang, Qianyu Zhou, Yuhang Li, Kui Su, Meili Wang, Jian Chang, Jian Zhang, Xuequan Lu
Title: Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
Abstract: While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for singletask or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.
PaperID: 2193,   Poster  https://arxiv.org/pdf/2603.05438    
Authors: Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak
Title: Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Abstract: World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning.Recent approaches leverage world models as learned simulators, but its application to decisiontime planning remains computationally prohibitive for real-time control.A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive.To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning.An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.
PaperID: 2194,   Poster  https://arxiv.org/pdf/2604.08916    
Authors: yibo zhao, Yigong Zhang, Jin Xie
Title: MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
Abstract: Conventional 3D instance segmentation methods rely on laborintensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods.
PaperID: 2195,   Poster  https://arxiv.org/pdf/2603.19616    
Authors: Chuanrui Zhang, Yingshuang Zou, ZhengXian Wu, Yonggen Ling, Yuxiao Yang, Ziwei Wang
Title: UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
Abstract: Perceiving and reconstructing objects from images are critical for realto-sim transfer tasks, which are widely used in the robotics community.Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline.However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context.To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework.Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity.We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks.Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area.Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.
PaperID: 2196,   Poster  https://arxiv.org/pdf/2603.00493    
Authors: Yuchen Che, JINGTU WU, Hao ZHENG, Asako Kanezaki
Title: COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
Abstract: Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, viewpoint changes, and outliers.A core difficulty lies in finding robust crossview correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse keypoints.We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem.COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as target marginals, naturally suppressing non-overlapping regions.Semantic priors from vision foundation model features further regularize the correspondences, leading to stable pose estimation.This design integrates confidence into the end-to-end correspondence finding and pose estimation pipeline, enabling fully unsupervised learning.Experiments show unsupervised COG achieves comparable performance to supervised methods, while the supervised variant outperforms them.
PaperID: 2197,   Poster  https://arxiv.org/pdf/2601.02730    
Authors: Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang
Title: HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
Abstract: Visual localization on standarddefinition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.
PaperID: 2198,   Poster  https://arxiv.org/pdf/2511.20629    
Authors: Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi
Title: MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax—improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Rewardaware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
PaperID: 2199,   Poster  https://arxiv.org/pdf/2511.20587    
Authors: Karim Kadry, Abdalla Abdelwahed, Ajay Manicka, Naravich Chutisilp, Farhad R. Nezami, Elazer R Edelman
Title: Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models
Abstract: We present an inferencetime guidance framework for generating 3D multi-class anatomical voxel maps with localized geometric and topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We penalize geometric features such as size, shape, position, and orientation through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement this guidance framework for latent diffusion models, where a neural field decoder can partially extract substructures, enabling efficient measurement and control of anatomical properties. This formulation unlocks a rich design space, where several constraints can be composed to control complex structures defined over arbitrary dimensions and coordinate systems. We show that Anatomica flexibly applies to a variety of anatomical systems, enabling the rational design of synthetic datasets for virtual simulation trials or machine learning workflows.
PaperID: 2200,   Poster  https://arxiv.org/pdf/2512.24146    
Authors: Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li
Title: Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Abstract: Recent studies have demonstrated significant progress in aligning textto-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D^2-Align), a novel framework that mitigates PMC by directionally correcting the reward signal.Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen.This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D^2-Align achieves superior alignment with human preference.
PaperID: 2201,   Poster  https://arxiv.org/pdf/2603.26052    
Authors: Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin
Title: Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Abstract: As the harm caused by fake news grows, the task of detecting and grounding multimodal media manipulation (DGM4) is gaining more attention. Existing multimodal methods overlook fine-grained semantic alignment between visual and textual modalities, thereby limiting their ability to detect sophisticated and subtle cross-modal manipulations. To address this challenge, we present MaLSF, a novel Mask-aware Local Semantic Fusion framework that explicitly bridges words and pixels via mask-label pairs, enabling the model to perform precise reasoning over fine-grained cross-modal correspondences. MaLSF captures cross-modal local semantics through two key innovations: 1) A Bidirectional Cross-modal Verification Module (BCV) that identifies semantic conflicts between masked regions and associated labels via a bidirectional query mechanism; 2) A Hierarchical Semantic Aggregation (HSA) Module that adaptively aggregates multi-granularity local semantics into decoupled features for task-specific verification. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. The proposed model is evaluated on multiple datasets and achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.
PaperID: 2202,   Poster  https://arxiv.org/pdf/2408.13516    
Authors: Yujin Lee, Sewon Kim, Daeun Moon, Seoyoon Jang, Hyunsoo Yoon
Title: Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
Abstract: Fewshot multi-class anomaly detection is crucial in real industrial settings, where only a few normal samples are available while numerous object types must be inspected. This setting is particularly challenging because defect patterns vary widely across categories while normal data remain scarce. Existing vision–language model–based approaches typically depend on class-specific anomaly descriptions or auxiliary modules, limiting both scalability and computational efficiency. In this work, we propose AnoPLe, a lightweight multimodal prompt learning framework that removes reliance on anomaly-type textual descriptions and avoids any external modules. AnoPLe employs bidirectional interactions between textual and visual prompts, allowing class semantics and instance-level cues to refine one another and form class-grounded representations that capture shared normal patterns across categories. To enhance localization, we design a scale-aware prefix trained on both global and local views, enabling the prompts to capture both global context and fine-grained details. In addition, an alignment loss propagates local anomaly evidence to global features, strengthening the consistency between pixel- and image-level predictions. Despite its simplicity, AnoPLe achieves strong performance on MVTec-AD, VisA, and Real-IAD under the few-shot multi-class setting, surpassing prior approaches while remaining efficient and free from expert-crafted anomaly descriptions. Moreover, AnoPLe generalizes well to unseen anomalies and even extends effectively to the medical domain.
PaperID: 2203,   Poster  https://arxiv.org/pdf/2512.20033    
Authors: Andreas Zinonos, Michał Stypułkowski, Antoni Bigata Casademunt, Stavros Petridis, Maja Pantic, Nikita Drobyshev
Title: FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
Abstract: We present FlashLips, a twostage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.
PaperID: 2204,   Poster  https://arxiv.org/pdf/2509.10388    
Authors: Zeqing Yuan, Mani Ramanagopal, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan
Title: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
Abstract: Decomposing an image into its underlying photometric factors—surface reflectance and shading—is a longstanding challenge due to the lack of extensive ground-truth data for real-world scenes. We introduce a novel physics-based approach for intrinsic image decomposition using a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities (or relative magnitudes) between visible and thermal image intensities to the ordinalities of shading and reflectance, which enables a dense self-supervision of an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over both classical physics-based and recent learning-based methods, providing a path toward scalable real-world data curation with supervision.
PaperID: 2205,   Poster  https://arxiv.org/pdf/2508.06656    
Authors: Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer
Title: ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
Abstract: Ingeneration watermarking for latent diffusion models has recently shown high robustness for marking generated images for easier detection and attribution. However, its application to autoregressive (AR) image models is underexplored. Autoregressive models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a VQ-VAE decoder. Inspired by KGW watermarking for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose a watermarking approach based on visual token clustering, which assigns similar tokens to the same set (red or green). We investigate token clustering in a training-free setting, as well as in combination with a robust fine-tuned token or cluster predictor. Overall, our experiments show that cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming a set of baselines and concurrent works. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking.
PaperID: 2206,   Poster  https://arxiv.org/pdf/2604.03305    
Authors: Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, Yi Wang
Title: HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Abstract: Recent methods have made notable progress in the visual quality of handobject interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. To achieve a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis, as well as the corresponding training and inference setting. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.
PaperID: 2207,   Poster  https://arxiv.org/pdf/2508.04097    
Authors: Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
Title: Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Abstract: Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior studies have primarily examined unimodal deep networks, the vulnerability of visionlanguage models (VLMs) remains largely unexplored. In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data.Our work makes two main contributions.First, tailored to the token-generative nature of VLMs,we introduce a suite of token-based and sequence-based model inversion strategies, providing a comprehensive analysis of VLMs' vulnerability under different attack formulations.Second, based on the observation that tokens vary in their visual grounding, and hencetheir gradients differ in informativeness for image reconstruction, we proposeSequence-based Model Inversion with Adaptive Token Weighting (SMI-AW)as a novel MI for VLMs. SMI-AW dynamically reweights each token's loss gradient according to its visual grounding, enabling the optimization to focus on visually informative tokens and more effectively guide the reconstruction of private images.Through extensive experiments and human evaluations on a range of state-of-the-art VLMs across multiple datasets, we show that VLMs are susceptible to training data leakage. Human evaluation of the reconstructed images yields an attack accuracy of 61.21%, underscoring the severity of these privacy risks.Notably, we demonstrate that publicly released VLMs are vulnerable to such attacks. Our study highlights the urgent need for privacy safeguards as VLMs become increasingly deployed in sensitive domains such as healthcare and finance.Code and additional experiments are provided in Supp.
PaperID: 2208,   Poster  https://arxiv.org/pdf/2509.13938    
Authors: Yifan Mo, Youcheng Cai, Ligang Liu
Title: Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
Abstract: 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving highquality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical solution of the PDE, which enhances both global and local constraints. Additionally, an effective Gaussian densification strategy and particle constraints are introduced to ensure fine-grained details. Extensive qualitative and quantitative experiments confirm that our method achieves state-of-the-art rendering and reconstruction quality.
PaperID: 2209,   Poster  https://arxiv.org/pdf/2512.04837    
Authors: Jikang Cheng, Renye Yan, Zhiyuan Yan, Yaozhong Gan, Xueyi Zhang, Wei Peng, Zhongyuan Wang, Ling Liang
Title: A Sanity Check for MultiIn-Domain Face Forgery Detection in the Real World
Abstract: Existing methods for deepfake detection aim to develop generalizable detectors. Although ``generalizable'' could be the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity, advancement, and vast volume of realworld deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications.However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC).In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a two-stage, model-agnostic framework termed DevDet (\underlineDeveloper for \underlineDetector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in effectively predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.
PaperID: 2210,   Poster  https://arxiv.org/pdf/2511.21690    
Authors: Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang
Title: TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Abstract: Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodimentshumans and different robots---are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation---a compact 3D "trace-space" of scene-level trajectories---that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation--trace--language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen’s ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
PaperID: 2211,   Poster  https://arxiv.org/pdf/2512.16811    
Authors: Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang
Title: GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Abstract: VisionLanguage-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
PaperID: 2212,   Poster  https://arxiv.org/pdf/2601.12781    
Authors: Hyejin Park, Junhyuk Kwon, Suha Kwak, Jungseul Ok
Title: VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
Abstract: Referring Expression Comprehension (REC) aims to localize the image region corresponding to a naturallanguage query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.Each operator executes and validates its output, such as object existence or spatial relationship, thereby allowing the system to robustly handle no-target cases when verification conditions are not met.Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. Furthermore, VIRO shows superior computational efficiency in terms of throughput, high reliability with a program failure rate of less than 0.3%, and scalability through decoupled program generation from execution.
PaperID: 2213,   Poster  https://arxiv.org/pdf/2511.17059    
Authors: Di Wu, Liu Liu, Anran Huang, 玉研 刘, Qiaojun Yu, Liu Shaofan, Liangtu Song, Cewu Lu
Title: REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Abstract: Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their partlevel surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Codes can be found in the supplementary material and will be made publicly available.
PaperID: 2214,   Poster  https://arxiv.org/pdf/2603.28366    
Authors: Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang
Title: AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Abstract: Shortform videos have become a primary medium for digital advertising, requiring scalable and efficient content creation.However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end video ad editing framework based on multimodal discretization and controllable generation. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video–audio–text token space. Built upon a foundation model, we further develop a multimodal large language model for intelligent video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts generated tokens into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable generative video creation.
PaperID: 2215,   Poster  https://arxiv.org/pdf/2602.22868    
Authors: Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen, Yanfeng Wang, Jiangchao Yao
Title: Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Abstract: Diffusion Large Language Models (DLLMs) promise fast nonautoregressive inference but suffer a severe quality and speed tradeoff in parallel decoding. This stems from the "combinatorial contradiction" phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation.ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a 2-8× inference speedup without any quality degradation.
PaperID: 2216,   Poster  https://arxiv.org/pdf/2603.26174    
Authors: Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao
Title: CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
Abstract: Instructionbased multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks.To address this gap, we propose CREval, a fully automated question–answer (QA)–based evaluation pipeline that that overcomes the incompleteness and poor interpretability of opaque Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries.Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval’s automated metrics and human judgments.Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research. All code and data will be released publicly.
PaperID: 2217,   Poster  https://arxiv.org/pdf/2510.27680    
Authors: Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel Church, Matthew Larson, Scott Perlman, Tomas Romero, Joshua Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Cho, Tyler Bradshaw
Title: PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Abstract: Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of wholebody 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians---the first of its kind for automated PET reporting---confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.
PaperID: 2218,   Poster  https://arxiv.org/pdf/2511.20032    
Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan
Title: Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference.To address this limitation, we propose VisionGuided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described.In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention.Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.
PaperID: 2219,   Poster  https://arxiv.org/pdf/2509.22225    
Authors: Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li
Title: ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
Abstract: Lifting 2D openvocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.
PaperID: 2220,   Poster  https://arxiv.org/pdf/2503.01347    
Authors: Ruikun Zhang, Yan Yang, Liyuan Pan
Title: From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Abstract: Spatial transcriptomics (ST) measures gene expression at finegrained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression:1) each spot often contains multiple cells with distinct gene expression profiles;2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.
PaperID: 2221,   Poster  https://arxiv.org/pdf/2512.06684    
Authors: Yumeng He, Zanwei Zhou, Yekun Zheng, Chen Liang, Yunbo Wang, Xiaokang Yang
Title: EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
Abstract: Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition tradeoffs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We presentEMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate aTeacher–Student bootstrapping mechanismthat uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.
PaperID: 2222,   Poster  https://arxiv.org/pdf/2602.01047    
Authors: Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong
Title: Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Abstract: Large VisionLanguage Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input.To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.
PaperID: 2223,   Poster  https://arxiv.org/pdf/2505.23343    
Authors: SIXIAN WANG, Zhiwei Tang, Tsung-Hui Chang
Title: Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering
Abstract: Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although trainingbased fine-tuning and inference-time alignment techniques aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)—the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.
PaperID: 2224,   Poster  https://arxiv.org/pdf/2511.22690    
Authors: Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Nguyen, Anh Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli
Title: Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Abstract: Despite recent advances in textto-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.
PaperID: 2225,   Poster  https://arxiv.org/pdf/2506.07917    
Authors: Allen Tu, Haiyang Ying, Alex Hanson, Yonghan Lee, Tom Goldstein, Matthias Zwicker
Title: SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
Abstract: Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve highquality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency–fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78× on average while maintaining neural-field fidelity and using 10× fewer primitives. Adding GroupFlow culminates in 13.71× faster rendering and 2.53× shorter training, surpassing all baselines in speed while preserving superior image quality.
PaperID: 2226,   Poster  https://arxiv.org/pdf/2602.21917    
Authors: Chen Wu, Ling Wang, Zhuoran Zheng, Yuning Cui, Zhixiong Yang, Xiangyu Chen, Yue Zhang, Weidong Jiang, Jingyuan Xia
Title: Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
Abstract: UltraHigh-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C^2SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C^2SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C^2SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.
PaperID: 2227,   Poster  https://arxiv.org/pdf/2604.00648    
Authors: Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu
Title: DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
Abstract: 3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with realtime, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch‐and‐interpolate resampling spreads each pixel’s value over a larger area, diluting detail density— causes 3DGS overfitting these low‐frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap–driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views—a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.
PaperID: 2228,   Poster  https://arxiv.org/pdf/2508.05135    
Authors: Thinh Nguyen, Le Trung Phan, Binh Nguyen, Khoa D Doan, KOK SENG WONG
Title: HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
Abstract: Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data. Despite its effectiveness, conventional FL faces scalability challenges due to excessive computational and communication demands placed on a single central server as the number of participating devices grows. Hierarchical Federated Learning (HFL) addresses these issues by distributing model aggregation tasks across intermediate nodes (stations), thereby enhancing system scalability and robustness against single points of failure. However, HFL still suffers from a critical yet often overlooked limitation: domain shift, where data distributions vary significantly across different clients and stations, reducing model performance on unseen target domains. While Federated Domain Generalization (FedDG) methods have emerged to improve robustness to domain shifts, their integration into HFL frameworks remains largely unexplored. In this paper, we formally introduce Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures. Specifically, we propose HFedATM, a hierarchical aggregation method that first aligns the convolutional filters of models from different stations through Filterwise Optimal Transport Alignment and subsequently merges aligned models using a Shrinkage-aware Regularized Mean Aggregation. Our extensive experimental evaluations demonstrate that HFedATM significantly boosts the performance of existing FedDG baselines across multiple datasets and maintains computational and communication efficiency. Moreover, theoretical analyses indicate that HFedATM achieves tighter generalization error bounds compared to standard hierarchical averaging, resulting in faster convergence and stable training behavior.
PaperID: 2229,   Poster  https://arxiv.org/pdf/2510.20470    
Authors: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
Title: Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Abstract: Video reasoning, which requires multistep deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we1)construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and2)design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.
PaperID: 2230,   Poster  https://arxiv.org/pdf/2512.15347    
Authors: Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, Ran He
Title: Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
Abstract: Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the tradeoff through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high- variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
PaperID: 2231,   Poster  https://arxiv.org/pdf/2511.15316    
Authors: Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi, Fan Li
Title: What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Abstract: Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIAFlow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image–feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
PaperID: 2232,   Poster  https://arxiv.org/pdf/2511.13175    
Authors: Chao Yang, Boqian Zhang, Jinghao Xu, Guang Jiang
Title: HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution
Abstract: Diffusionbased methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.
PaperID: 2233,   Poster  https://arxiv.org/pdf/2604.02752    
Authors: Jinfan Liu, Wuze Zhang, Zhangli Hu, Zhehan Zhao, Ye Chen, Bingbing Ni
Title: Differentiable Stroke Planning with Dual Parameterization for Efficient and HighFidelity Painting Creation
Abstract: In strokebased rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30–50%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30–40% compared to existing differentiable vectorization methods.
PaperID: 2234,   Poster  https://arxiv.org/pdf/2603.04337    
Authors: Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao, Wen Liu, Wenrui Ding, Yi Ma, Shenghua Gao
Title: Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Abstract: Constructing computeraided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.
PaperID: 2235,   Poster  https://arxiv.org/pdf/2603.04908    
Authors: Li'an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Wang, xiangui Kang
Title: AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Abstract: Hallucination has been a significant impediment to the development and application of current Large VisionLanguage Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions.To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Increase Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions.To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head.Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates C_S and C_I on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.
PaperID: 2236,   Poster  https://arxiv.org/pdf/2505.19888    
Authors: EunGyung Kong, Jewon Yeom, Yonghoon Jeon, Taesup Kim
Title: Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
Abstract: Federated Learning (FL) facilitates decentralized model training while preserving data privacy. However, achieving both robust generalization and effective personalization simultaneously in heterogeneous (nonIID) environments remains a formidable challenge. Furthermore, the widespread adoption of proprietary Foundation Models (FMs) introduces a critical requirement for dual privacy: (a) protecting sensitive client data and (b) securing the server's valuable intellectual property. This mandates strictly black-box access to the FM. To address these multifaceted challenges, we introduce FedOT, a novel FL framework optimized for black-box FMs. FedOT employs a shared global task-dependent classifier while facilitating local adaptation through client-specific orthogonal transformations applied externally to the FM embeddings. This architecture inherently guarantees that the FM's internal parameters remain inaccessible and unmodified. By enforcing orthogonality, FedOT effectively mitigates gradient conflicts across diverse clients, which is theoretically bounded, preserves the semantic integrity of the FM representations, and achieves robust performance under significant data heterogeneity. The synergy of global and local parameters optimally balances generalization and personalization, markedly outperforming baseline FL methods across diverse benchmarks. Extensive empirical analysis, including rigorous multi-seed validation and scalability assessments, substantiates the robustness, efficiency, and superior performance of FedOT.
PaperID: 2237,   Poster  https://arxiv.org/pdf/2603.01007    
Authors: Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, Huai Yu
Title: Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Abstract: 3D occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixellevel accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D^2-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R^2-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D--nuScenes benchmark demonstrate that Dr.Occ improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting. Code will be made publicly available.
PaperID: 2238,   Poster  https://arxiv.org/pdf/2505.17006    
Authors: Jiange Yang, tom tomlinson, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang
Title: CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Abstract: Unsupervised learning of latent motion from Internet videos is crucial for building generalist robots. Existing discrete methods generally mitigate the shortcut learning problem caused by extracting excessive static background information through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and finegrained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the difficulty of shortcut learning and explicitly enhance motion cues. Additionally, to ensure that continuous latent motion better captures meaningful foreground information, we further propose a temporal contrastive learning (Tcn) scheme. Specifically, positive pairs are constructed from motion representations with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcn work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate effective pseudo action labels for unseen videos. The shared continuous distribution of robot action and video latent motion also significantly benefits the joint learning of unified policy. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and autoregressive architectures.
PaperID: 2239,   Poster  https://arxiv.org/pdf/2511.20520    
Authors: Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang
Title: HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixtureof-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
PaperID: 2240,   Poster  https://arxiv.org/pdf/2604.03972    
Authors: Xueyang Kang, Zizhao Li, Tian Lan, Dong Gong, Kourosh Khoshelham, Liangliang Nan
Title: Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
Abstract: 3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via outof-distribution feature separation or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, surface misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point–patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 50% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.
PaperID: 2241,   Poster  https://arxiv.org/pdf/2603.11680    
Authors: Thien Tan Cao, Phan Thi Thu Trang, Duc N. Do, Ho Anh, Nguyen Duc Dung, Duc Dung Nguyen
Title: UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
Abstract: Hybrid CNNTransformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 (4×), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration. The code will be published later.
PaperID: 2242,   Poster  https://arxiv.org/pdf/2504.12606    
Authors: Changsheng Lv, Zijian Fu, Mengshi Qi
Title: Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
Abstract: In this paper, we propose RoboSGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, representing the global structure of an image, which is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, under corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.
PaperID: 2243,   Poster  https://arxiv.org/pdf/2512.16523    
Authors: Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
Title: TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Abstract: VisionLanguage Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
PaperID: 2244,   Poster  https://arxiv.org/pdf/2512.16727    
Authors: Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han, HaoYang ZHANG, Liang Xie, Hongbo Chen, Erwei Yin
Title: OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and taskspecific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition. Our code is available in Suppl. Mat. and dataset will be available later.
PaperID: 2245,   Poster  https://arxiv.org/pdf/2509.01644    
Authors: Yanqing Liu, Xianhang li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie
Title: OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Abstract: This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior visionlanguage pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.
PaperID: 2246,   Poster  https://arxiv.org/pdf/2603.25864    
Authors: Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim
Title: GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Abstract: Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software. While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency.To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI Understanding, Intent, and Help Decision Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in openended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations that surface user intent, across 10 complex software (e.g., PowerPoint, Photoshop). GUIDE defines three tasks—(i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model’s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled with the tasks, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context such as behavioral state and intent significantly improved the performance, raising help prediction by up to 50.2%. These results highlight the critical role of structured user understanding in effective assistance.Our benchmark provides a path toward GUI agents that go beyond automation to become truly user-aware collaborators.
PaperID: 2247,   Poster  https://arxiv.org/pdf/2506.02794    
Authors: Mijeong Kim, Gunhee Kim, Jungyoon Choi, WonJae Roh, Bohyung Han
Title: PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
Abstract: We introduce PhysGaia, a novel physicsaware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS), encompassing both structured objects and unstructured physical phenomena.Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling.Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces.Furthermore, it contains a diverse range of physical materials, such as liquid, gas, textile, and rheological substances, which moves beyond the rigid bodies prevalent in existing datasets.All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical modeling, our dataset provides essential ground-truth information, including 3D particle trajectories and physics parameters, e.g., viscosity.To facilitate research adoption, we also provide essential integration pipelines for using recent 4D Gaussian Splatting models with our dataset and report their results.By addressing the critical lack of datasets for physics-aware modeling, PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation-ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.
PaperID: 2248,   Poster  https://arxiv.org/pdf/2603.06043    
Authors: Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang
Title: Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
Abstract: Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex textto-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities.This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes.While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts.To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality.We propose a token-level intrinsic text-image alignment reward mechanism,GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals—without reliance on external supervision.Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.
PaperID: 2249,   Poster  https://arxiv.org/pdf/2511.22344    
Authors: Denis Huseljic, Marek Herde, Lukas Rauch, Paul Hahn, Bernhard Sick
Title: Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
Abstract: Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coveragebased selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods. Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.
PaperID: 2250,   Poster  https://arxiv.org/pdf/2602.21819    
Authors: Minghan Yang, LAN YANG, Ke Li, Honggang Zhang, Kaiyue Pang, Yi-Zhe Song
Title: SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
Abstract: Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRIbased image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions.To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.
PaperID: 2251,   Poster  https://arxiv.org/pdf/2604.08896    
Authors: Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, Naoto Yokoya
Title: GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Abstract: Recent advances in multimodal large language models (MLLMs) have accelerated progress in domainoriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models (LLMs), uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning—capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.
PaperID: 2252,   Poster  https://arxiv.org/pdf/2512.22647    
Authors: Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Xueyang Fu, Zheng-Jun Zha
Title: FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution
Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image SuperResolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions.This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking.Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
PaperID: 2253,   Poster  https://arxiv.org/pdf/2512.13072    
Authors: Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen, Dingkang Yang, Lihua Zhang
Title: Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
Abstract: Multimodal biomedical VisionLanguage Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous Medical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
PaperID: 2254,   Poster  https://arxiv.org/pdf/2512.07472    
Authors: Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, Chang Xu
Title: Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation
Abstract: VisionLanguage-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the ``Memory Trap''. This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones (\pi_0 and \pi_0.5) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.
PaperID: 2255,   Poster  https://arxiv.org/pdf/2603.04265    
Authors: Luigi Seminara, Davide Moltisanti, Antonino Furnari
Title: ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
Abstract: Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on largescale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.
PaperID: 2256,   Poster  https://arxiv.org/pdf/2511.22555    
Authors: Yanbo Mao, Jianlong Fu, Ruoxuan Zhang, Hongxia Xie, Meibao Yao
Title: Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
Abstract: VisionLanguage-Action (VLA) models have enabled notable progress in general-purpose robotic manipulation, yet their learned policies often exhibit variable execution quality. We attribute this variability to the mixed-quality nature of human demonstrations, where the implicit principles that govern how actions should be carried out are only partially satisfied. To address this challenge, we introduce the LIBERO-Elegant benchmark with explicit criteria for evaluating execution quality. Using these criteria, we develop a decoupled refinement framework that improves execution quality without modifying or retraining the base VLA policy. We formalize Elegant Execution as the satisfaction of Implicit Task Constraints (ITCs) and train an Elegance Critic via offline Calibrated Q-Learning to estimate the expected quality of candidate actions. At inference time, a Just-in-Time Intervention (JITI) mechanism monitors critic confidence and intervenes only at decision-critical moments, providing selective, on-demand refinement. Experiments on LIBERO-Elegant and real-world manipulation tasks show that the learned Elegance Critic substantially improves execution quality, even on unseen tasks. The proposed model enables robotic control that values not only whether tasks succeed, but also how they are performed.
PaperID: 2257,   Poster  https://arxiv.org/pdf/2511.00846    
Authors: Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Min Ju, Peter Woo, Yixuan Yuan
Title: OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
Abstract: Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual questionanswering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform open-source and medical MLLMs yet lag far behind physicians (91.35%), while medical MLLMs show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all MLLMs fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians.
PaperID: 2258,   Poster  https://arxiv.org/pdf/2604.15941    
Authors: Haato Watanabe, Nobuyuki Umetani
Title: Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
Abstract: Recent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, realtime rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy.Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NerRF360 and high-frequency surface datasets (e.g., checkered patterns), supported by comprehensive ablation studies.
PaperID: 2259,   Poster  https://arxiv.org/pdf/2511.14197    
Authors: Zitang Sun, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi
Title: Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
Abstract: Highquality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model’s evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
PaperID: 2260,   Poster  https://arxiv.org/pdf/2508.06859    
Authors: Shuo Tang, Jian Xu, Jiadong Zhang, yi chen, Qizhao Jin, Lingdong Shen, Cheng-Lin Liu, Shiming Xiang
Title: MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
Abstract: Timely and accurate forecasts of severe weather events are essential for early warning and for constraining downstream analysis and decisionmaking. Since severe weather events prediction still depends on subjective, time-consuming expert interpretation, end-to-end “AI weather station” systems are emerging but face three major challenges: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) current multimodal language models cannot effectively process high-dimensional meteorological inputs or capture their complex spatiotemporal dependencies. To address these challenges, we introduce MP-Bench, the first large-scale multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios. On top of this dataset, we develop a Meteorology Multimodal Large Model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench show that MMLM achieves strong performance across multiple tasks, demonstrating effective severe weather understanding and representing a key step toward automated, AI-driven severe weather events forecasting systems. Our source code and dataset will be made publicly available.
PaperID: 2261,   Poster  https://arxiv.org/pdf/2508.03142    
Authors: Bai Chengyu, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang
Title: UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
Abstract: Recent advances in diffusion models and visionlanguage models (VLMs) have significantly enhanced the controllability of image editing. Methods like FlowEdit enable step-by-step editing along a visible, noise-free trajectory, where each intermediate result is a clear image, eliminating the need for full noise inversion. However, these approaches still operate in pixel space or VAE latent space, where intermediate outputs often suffer from visual artifacts, distortions, or unrealistic details—making reliable semantic evaluation difficult. Furthermore, they remain open-loop systems, applying static edits without feedback to guide or correct the editing process adaptively. We propose UniEdit-I, the first training-free, closed-loop image editing framework that operates entirely within the semantic latent space of a unified VLM by introducing an Understanding–Editing–Verifying (UEV) loop:(1) Understanding: parses the source image and editing instruction into a structured source prompt and a minimal target specification;(2) Editing: applies dynamic semantic offsets, with a configurable feedback weighting mechanism that adaptively modulates editing intensity based on real-time alignment feedback;(3) Verifying: leverages the VLM’s own multimodal reasoning capability to evaluate the intermediate output along multiple semantic dimensions and trigger early stopping or refinement. By transforming the VLM from a post-hoc evaluator into an in-process conductor, UniEdit-I establishes the first semantics-driven, self-correcting closed-loop image editing pipeline. Evaluated on GEdit-Bench, UniEdit-I achieves state-of-the-art performance without any fine-tuning or architectural modifications, and even surpasses several large-scale pre-trained editors.
PaperID: 2262,   Poster  https://arxiv.org/pdf/2603.25118    
Authors: Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer
Title: AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Abstract: Document generation has emerged as a crucial task for automating the creation of visually appealing and wellstructured content across diverse domains. Existing methods in this field, however, suffer from some limitations in terms of application scope, document representation and dataset coverage, which greatly restricts the capabilities of document generation models. To address these challenges, we propose OmniDoc, a framework that introduces HTML/CSS as a novel document representation given its inherent advantages in hierarchical structure modeling. Leveraging HTML/CSS, OmniDoc establishes a scalable data synthesis pipeline to curate DocHTML, a large-scale document dataset containing 265,206 high-quality samples. Each document in DocHTML includes complete metadata annotations, structured HTML/CSS source code, synthesized visual assets, and rendered screenshots, spanning diverse categories, styles, and complexity levels to ensure comprehensive coverage. OmniDoc then utilizes DocHTML to fine-tune the multimodal large language models, empowering them remarkable document generation capabilities on three practical tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issues found in the fine-tuned models, we incorporate a height-aware post-training method within OmniDoc based on Group Relative Policy Optimization. By carefully designing the reward function to measure the alignment between predicted and target document heights, OmniDoc effectively alleviates the overflow problem, further enhancing model performance. Qualitative and quantitative results demonstrate the superiority of OmniDoc over baseline models across all three tasks. Extensive ablation studies manifest the effectiveness of the HTML/CSS representation, curated dataset, and height-aware reinforcement optimization.
PaperID: 2263,   Poster  https://arxiv.org/pdf/2512.05111    
Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
Title: ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Abstract: Reward models are critical for aligning visionlanguage systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks.We presentARM-Thinker, anAgentic multimodalRewardModel that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring.This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models.We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.To evaluate agentic reward modeling, we introduceARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification).ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks.Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
PaperID: 2264,   Poster  https://arxiv.org/pdf/2604.02877    
Authors: Yu ZHU, Kang LI, LI ZHENG, Pheng-Ann Heng
Title: Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
Abstract: To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a selfreflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.
PaperID: 2265,   Poster  https://arxiv.org/pdf/2603.03192    
Authors: Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani
Title: MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Abstract: Omnimodal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audio-visual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
PaperID: 2266,   Poster  https://arxiv.org/pdf/2511.19854    
Authors: Jiankuo Zhao, Xiangyu Zhu, Zidu Wang, Zhen Lei
Title: STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
Abstract: Reconstructing highfidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to geometric and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces the fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.
PaperID: 2267,   Poster  https://arxiv.org/pdf/2511.20996    
Authors: Jingxi Chen, Yixiao Zhang, Xiaoye qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos
Title: From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusionbased inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modality context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.
PaperID: 2268,   Poster  https://arxiv.org/pdf/2603.13795    
Authors: Minh-Duong Nguyen, Senura Hansaja Wanasekara, Le-Tuan Nguyen, Ken-Tye Yong, Quoc-Viet Pham, Nguyen H. Tran, Dung D. Le
Title: Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
Abstract: Federated Unlearning (FUL) aims to remove specific participants' data contributions from a trained Federated Learning model, thereby ensuring data privacy and compliance with regulatory requirements. Despite its potential, progress in FUL has been limited due to several challenges, including the crossclient knowledge inaccessibility and high computational and communication costs.To overcome these challenges, we propose Federated On-server Unlearning (FOUL), a novel framework that comprises two key stages. The learning-to-unlearn stage serves as a preparatory learning phase, during which the model identifies and encodes the key features associated with the forget clients. This stage is communication-efficient and establishes the basis for the subsequent unlearning process.Subsequently, on-server knowledge aggregation phase aims to perform the unlearning process at the server without requiring access to client data, thereby preserving both efficiency and privacy.We introduce a new data setting for FUL, which enables a more transparent and rigorous evaluation of unlearning. To highlight the effectiveness of our approach, we propose a novel evaluation metric termed time-to-forget, which measures how quickly the model achieves optimal unlearning performance.Extensive experiments conducted on three datasets under various unlearning scenarios demonstrate that FOUL outperforms the Retraining in FUL. Moreover, FOUL achieves competitive or superior results with significantly reduced time-to-forget, while maintaining low communication and computation costs.
PaperID: 2269,   Poster  https://arxiv.org/pdf/2601.00090    
Authors: Anne Harrington, A. Koepke, Shyamgopal Karthik, Trevor Darrell, Alexei A. Efros
Title: It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Abstract: Contemporary textto-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and optimize for variation in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.
PaperID: 2270,   Poster  https://arxiv.org/pdf/2511.07738    
Authors: Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po
Title: From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on highquality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation—enabling more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones-Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B—spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
PaperID: 2271,   Poster  https://arxiv.org/pdf/2512.04643    
Authors: Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang
Title: SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose SelfDiagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
PaperID: 2272,   Poster  https://arxiv.org/pdf/2602.24144    
Authors: Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin, Tao He
Title: Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
Abstract: Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher’s statistics. However, current residualmatching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA--a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.
PaperID: 2273,   Poster  https://arxiv.org/pdf/2512.21058    
Authors: Minghao Han, Yichen Liu, Yizhou Liu, Zizhi Chen, Jingqun Tang, Xuecheng Wu, Dingkang Yang, Lihua Zhang
Title: Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnosticlevel competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image–text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image–text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho‑FID of 80.9 (51% better than the second-best) and fine-grained semantic control, achieving 98.7% of the real-image. Our curated datasets, code, and model weights will be made publicly available.
PaperID: 2274,   Poster  https://arxiv.org/pdf/2509.18891    
Authors: Xueyu Liu, Xiaoyi Zhang, Meilin Liu, Guangze Shi, Jia Shen, Yujie Wang, Cai Zhao, Ziyuan He, Yongfei Wu, Mingqiang Wei, Yongle Chen
Title: Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model
Abstract: Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attackfor-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM’s segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM’s robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.
PaperID: 2275,   Poster  https://arxiv.org/pdf/2601.04068    
Authors: Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo
Title: Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
Abstract: Aligning textto-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and inpainting only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
PaperID: 2276,   Poster  https://arxiv.org/pdf/2602.23120    
Authors: Arian Sabaghi, Jose Oramas
Title: TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
Abstract: Weakly supervised object localization (WSOL) aims to localize target objects in images using only imagelevel labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with DINOv2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be available upon paper acceptance at https://anonymousRepoURL.com.
PaperID: 2277,   Poster  https://arxiv.org/pdf/2602.19117    
Authors: Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang
Title: Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
Abstract: Perspectiveaware spatial reasoning involves understanding spatial relationships from specific viewpoints—either egocentric (observer-centered) or allocentric (object-centered).While vision–language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene.In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well.By leveraging four key factors—projection, abstraction, bipartition, and localization—SymPL converts allocentric questions into structured symbolic-layout representations.Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains.These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.
PaperID: 2278,   Poster  https://arxiv.org/pdf/2603.04989    
Authors: Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu
Title: TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and longterm motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking.Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates—bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light.To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions.Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. We will release the code and dataset upon acceptance to support future research.
PaperID: 2279,   Poster  https://arxiv.org/pdf/2604.04425    
Authors: Green Rosh, Prateek Kukreja, Vishakha SR, Pawan Prasad B H
Title: HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
Abstract: The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zeroshot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.
PaperID: 2280,   Poster  https://arxiv.org/pdf/2511.14072    
Authors: Jingyu Lei, Gaoang Wang, Der-Horng Lee
Title: CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Abstract: Large VisionLanguage Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative multimodal benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.
PaperID: 2281,   Poster  https://arxiv.org/pdf/2512.04000    
Authors: Jialuo Li, Bin Li, Jiahao Li, Yan Lu
Title: Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Abstract: The application of Large Multimodal Models (LMMs) to longform video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose~\name, a training-free frame selection framework that adapts its strategy based on the query type. Specifically, DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
PaperID: 2282,   Poster  https://arxiv.org/pdf/2603.11521    
Authors: Jiang Shuo, Gaojia Zhang, Min Tan, Yufei Yin, Gang Pan
Title: EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
Abstract: Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudolabels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.
PaperID: 2283,   Poster  https://arxiv.org/pdf/2601.07298    
Authors: Jianghao Yin, Qingbin Li, KUN SUN, Cheng Ding, Jie Wang, Qin Chen, Jie Zhou, Nan Wang, Changqing Li, wupei wupei, Jian Xu, Zheming Yang, Liang He
Title: Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
Abstract: While Multimodal Large Language Models (MLLMs) excel at singleimage understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ.For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation.To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
PaperID: 2284,   Poster  https://arxiv.org/pdf/2511.18287    
Authors: Rui Peng, Ziru Liu, Lingyuan Ye, Yuxing Lu, Boxin Shi, Jinzhuo Wang
Title: TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
Abstract: Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as Perturbation \rightarrow RNA or Perturbation \rightarrow Morphology, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms stateof-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome–phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.
PaperID: 2285,   Poster  https://arxiv.org/pdf/2601.03362    
Authors: Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, Christopher Schroers
Title: Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
Abstract: Soft boundaries, like thin hairs, are commonly observed in natural and computergenerated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.
PaperID: 2286,   Poster  https://arxiv.org/pdf/2512.17514    
Authors: Sairam Rebbapragada, Rishabh Lalla, Aveen Dayal, Tejal Kulkarni, Anuj Lalla, Vineeth Balasubramanian, Muhammad Haris Khan
Title: Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
Abstract: Current stateof-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector’s ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector’s feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.
PaperID: 2287,   Poster  https://arxiv.org/pdf/2512.02487    
Authors: Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo
Title: Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
Abstract: Recent advances in 3D scenelanguage understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.
PaperID: 2288,   Poster  https://arxiv.org/pdf/2603.02522    
Authors: Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen
Title: NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
Abstract: Masked Image Modeling has been one of the most popular selfsupervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Since we focus here on geometric (spatial) innovation rather than spectral, we conduct experiment in a RGB setting following previous works. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.
PaperID: 2289,   Poster  https://arxiv.org/pdf/2512.01550    
Authors: Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, Mu Xu
Title: NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Abstract: Embodied navigation for longhorizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
PaperID: 2290,   Poster  https://arxiv.org/pdf/2604.03738    
Authors: Binyuan Huang, Yuning Lu, Weinan Jia, hualiang wang, Mu Liu, Daiqin Yang
Title: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
Abstract: Recent proprietary models such as Sora‑2 demonstrate promising progress in generating multi‑shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model’s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token‑level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi‑reference and multi‑shot video generation model capable of accurately controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross‑shot consistency and reference fidelity compared with various baselines.
PaperID: 2291,   Poster  https://arxiv.org/pdf/2603.16966    
Authors: Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Shijing Wang, Zhaolong Huang, Yanlong Du, Wenji Mao
Title: CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
Abstract: Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore openworld speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
PaperID: 2292,   Poster  https://arxiv.org/pdf/2511.20549    
Authors: Guanjie Chen, ShiruiHuang ShiruiHuang, Yifu Sun, Kai Liu, Jianchen Zhu, Xiaoye Qu, Yu Cheng, Peng Chen
Title: Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, finetuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement.Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only 2.1% its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are attached in the supplementary.
PaperID: 2293,   Poster  https://arxiv.org/pdf/2604.14622    
Authors: Junfeng Li, Wenyang Zhou, Xueheng Li, Xuanhua He, Jianhou Gan, Wenqi Ren
Title: Multigrain-aware Semantic Prototype Scanning and Tri-token Prompt Learning embraced High-order RWKV for Pan-sharpening
Abstract: In this work, we propose a Multigrainaware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a KV-sharing RWKV architecture for efficient global modeling, coupled with a novel tri-token prompting mechanism derived from semantic clustering to steer the fusion process adhering to the following principles: 1) Multigrain-aware Semantic Prototype Scanning. While the RWKV model offers an efficient linear alternative, its recurrent scanning mechanism often introduces positional bias and lacks semantic guidance. To address this, we introduce a semantic-driven scanning strategy. Local hashing is first employed to generate semantic prototypes via clustering, segmenting the image into coherent regions. Our scanning mechanism is then explicitly aware of multi-grain semantic structures, allowing the model to focus on contextually relevant regions during fusion, thereby enhancing spectral integrity and spatial coherence beyond sequence-agnostic approaches. 2) Tri-token Prompt Learning. The core of our framework is a tri-token prompting mechanism: (i) a globally-sourced token to encapsulate the holistic image context, (ii) cluster-derived prototype tokens to represent distinct semantic regions, and (iii) learnable token register that acts as a dynamic buffer to explicitly identify and eliminate feature noisy artifacts that commonly arise from standard global modeling. The global and prototype tokens are broadcast as semantic prompts to guide RWKV's processing, while the register continuously refines the intermediate features. 3) Invertible Q-Shift. To counteract spatial detail, we tailor two key designs: apply a center difference convolution on value pathway within the RWKV block, actively injecting high-frequency information to preserve fine textures and moving beyond parameter-heavy receptive field expansion via invertible neural network empowered multi-scale Q-shift operation. This module performs efficient, lossless feature transformation and shifting across split channels, significantly enriching feature representation. Experimental results demonstrate superiority of our method.
PaperID: 2294,   Poster  https://arxiv.org/pdf/2512.02700    
Authors: Zhenkai Wu, Xiaowen Ma, ZHENLIANG NI, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen
Title: VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Abstract: Vision–language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook intertoken redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects.To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a Centrifugal Token Pruning Paradigm (CTPP) to prioritize the preservation of fine-grained object details. Leveraging CTPP, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones.Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9% pruning rate, while also delivering an end-to-end inference speedup. The code is available in Supplementary Material.
PaperID: 2295,   Poster  https://arxiv.org/pdf/2511.13132    
Authors: Chenyang LI, Wenbing Tang, Yihao Huang, Simon Sinong Zhan, Ming Hu, Xiaojun Jia, Yang Liu
Title: Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
Abstract: Visionand-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
PaperID: 2296,   Poster  https://arxiv.org/pdf/2603.25527    
Authors: Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Shao-Lun Huang
Title: Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of highquality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion quality inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model’s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for high motion quality data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.
PaperID: 2297,   Poster  https://arxiv.org/pdf/2508.01873    
Authors: Siran Peng, Haoyuan Zhang, Li Gao, Tianshuo Zhang, Xiangyu Zhu, Bao Li, Weisong Zhao, Zhen Lei
Title: DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
Abstract: The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusionbased framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder–decoder architecture: a pretrained forgery detector serves as a powerful "artifact encoder", and a denoising diffusion model is repurposed as an "artifact decoder". Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments demonstrate that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness, reliability, and explainability. The code will be released upon acceptance.
PaperID: 2298,   Poster  https://arxiv.org/pdf/2511.10376    
Authors: Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen
Title: MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Realworld deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies.To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the “last mile” problem in zero-shot navigation — determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
PaperID: 2299,   Poster  https://arxiv.org/pdf/2511.18281    
Authors: Yara Bahram, Mélodie Desbos, Mohammadhadi Shateri, Eric Granger
Title: Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
Abstract: Diffusion models (DMs) produce highquality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher’s domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with \leq4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.
PaperID: 2300,   Poster  https://arxiv.org/pdf/2603.16538    
Authors: ManGyu Kong, Jaewon Lee, Seongwon Lee, Euntai Kim
Title: Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its highquality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information–based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.
PaperID: 2301,   Poster  https://arxiv.org/pdf/2603.01038    
Authors: Haoyuan Zhang, Keyao Wang, Guosheng Zhang, Haixiao Yue, Zhiwen Tan, Siran Peng, Tianshuo Zhang, Xiao Tan, KunbinChen KunbinChen, Wei He, Jingdong Wang, Ajian Liu, Xiangyu Zhu, Zhen Lei
Title: From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
Abstract: Face recognition remains vulnerable to presentation attacks, calling for robust face antispoofing (FAS) solutions. Recent Multimodal Large Language Model (MLLM)-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., screen borders or mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves state-of-the-art performance while providing fine-grained visual investigation for trustworthy spoof detection.
PaperID: 2302,   Poster  https://arxiv.org/pdf/2512.02536    
Authors: Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing LYU, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pretrained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
PaperID: 2303,   Poster  https://arxiv.org/pdf/2503.13543    
Authors: Xinghao Wu, Jianwei Niu, Xuefeng Liu, Guogang Zhu, Jiayuan Zhang, Shaojie Tang, Wei Chen
Title: Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning
Abstract: Federated Prototype Learning (FedCL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedCL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedCL highly depends on the quality of prototypes. Existing methods assume that larger interclass distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.
PaperID: 2304,   Poster  https://arxiv.org/pdf/2602.21864    
Authors: Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James Kwok, Yu Zhang
Title: DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Abstract: VisionLanguage Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on a single type of graph topology representation (GTR) of graphs, such as fixed-style visual images or unified text descriptions. This "one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or overly lengthy responses to graph-related queries. To address this, we propose the \mboxDynamicGTR framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.
PaperID: 2305,   Poster  https://arxiv.org/pdf/2511.21565    
Authors: Kang DU, 雪 廖, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang ShengHuang, Zeyu Wang
Title: UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Abstract: Illumination inconsistency is a fundamental challenge in multiview 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.
PaperID: 2306,   Poster  https://arxiv.org/pdf/2603.00207    
Authors: Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu, Hongjing Zhang, Jakub Zablocki, Yifan Xing, Qin ZHANG
Title: VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
Abstract: Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling testtime compute through extended inference-time thinking. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference-time can often degrade performance as models progressively lose attention to visual tokens, increasingly relying on textual priors alone. To address this, prior works used reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of inference-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process through re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context yet diverse and globally representative of the image for more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that under fixed inference-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
PaperID: 2307,   Poster  https://arxiv.org/pdf/2512.02902    
Authors: weiqi li, QuandeZhang QuandeZhang, ruifeng zhai, Liang Lin, Guangrun Wang
Title: VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Abstract: Visionlanguage-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters—matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
PaperID: 2308,   Poster  https://arxiv.org/pdf/2603.20448    
Authors: M. Kerem Aydin, Vishwanath Saragadam, Emma Alexander
Title: Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
Abstract: Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visiblelight images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.
PaperID: 2309,   Poster  https://arxiv.org/pdf/2603.25165    
Authors: Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache
Title: Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
Abstract: Recent advances in selfsupervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to localization, and often require full finetuning for strong performance. Accurate localization is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that supports all downstream tasks on 3D data. In this work, we introduce PointINS, a localization-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal localization branch to jointly learn high-level semantic understanding and geometric reasoning, yielding localization awareness. We identify two consistent properties essential for robust localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models. Code will be released upon acceptance.
PaperID: 2310,   Poster  https://arxiv.org/pdf/2602.20200    
Authors: Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie
Title: Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
Abstract: Hierarchical Vision–Language–Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision–Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dualmemory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over \pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing \pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9× inference speedup.
PaperID: 2311,   Poster  https://arxiv.org/pdf/2508.21113    
Authors: Qi Yang, Bolin Ni, Shiming Xiang, Houwen Peng
Title: R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Abstract: Multimodal Large Language Models (MLLMs) with explicit stepby-step reasoning have achieved strong performance on complex tasks. However, such reasoning is unnecessary for many simple queries and introduces substantial computational overhead. To address this inefficiency, we present R-4B, an auto-thinking MLLM that dynamically determines whether to invoke the reasoning process based on input complexity.Our key idea is to equip a single model with both thinking and non-thinking capabilities and train it to select the appropriate mode. We first introduce bi-mode annealing, a unified training paradigm that constructs a model competent in both reasoning-intensive and direct-answer settings without requiring explicit complexity annotations. Building on this foundation, we propose Bi-mode Policy Optimization (BPO), a lightweight reinforcement learning algorithm that employs a dual-rollout mechanism: for each input, the model generates both thinking and non-thinking responses. This prevents mode collapse and enables robust learning of an adaptive reasoning policy using only simple, rule-based rewards.Extensive experiments across 25 benchmarks show that R-4B achieves state-of-the-art performance among models of similar scale. It consistently surpasses Qwen2.5-VL-7B and matches or exceeds larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive tasks, while reducing computational cost by avoiding redundant reasoning. Our results demonstrate that adaptive auto-thinking offers an effective and scalable pathway toward more efficient multimodal reasoning models.
PaperID: 2312,   Poster  https://arxiv.org/pdf/2604.12625    
Authors: jianhui wu, Jian Zhou, Zhi Zhou, Zhangjin Huang, Chao Li
Title: Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
Abstract: Highquality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead.To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.
PaperID: 2313,   Poster  https://arxiv.org/pdf/2603.17538    
Authors: Jaein Kim, Hee Bin Yoo, Dong-Sig Han, Byoung-Tak Zhang
Title: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
Abstract: A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this tradeoff, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.
PaperID: 2314,   Poster  https://arxiv.org/pdf/2604.08503    
Authors: Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou
Title: PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Abstract: Recent advances in generative video modeling, driven by largescale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics.In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose PHANTOM, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, PHANTOM jointly predicts latent physical dynamics and generates future video frames.PHANTOM leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, PHANTOM produces video sequences that are both visually realistic and physically consistent.Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that PHANTOM not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
PaperID: 2315,   Poster  https://arxiv.org/pdf/2602.16412    
Authors: Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura
Title: ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
Abstract: While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, longform video understanding remains a significant challenge.In this study, we focus on video understanding by MLLMs.This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length.In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations.A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames.These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding.To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation.Furthermore, our model compresses these features in a way that scales linearly with sequence length.We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks.ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.Our project page and code are provided in the supplementary materials.
PaperID: 2316,   Poster  https://arxiv.org/pdf/2603.14772    
Authors: Joohyun Kwon, Geonhee Sim, Gyeongsik Moon
Title: Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
Abstract: Existing singleimage 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow–guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods. Code, pretrained models, and reannotations will be released.
PaperID: 2317,   Poster  https://arxiv.org/pdf/2505.20270    
Authors: Jinsheng Quan, Qiaowei Miao, Yichao Xu, Zizhuo Lin, Ying Li, Wei Yang, Zhihui Li, Yawei Luo
Title: ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
Abstract: The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved highfidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical reasoning into dynamic 3D representations, enabling accurate and consistent prediction of the future. Experiments show that ParticleGS achieves state-of-the-art performance in extrapolation while maintaining rendering quality comparable to leading dynamic 3D reconstruction methods.
PaperID: 2318,   Poster  https://arxiv.org/pdf/2602.19190    
Authors: Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun Baiyun, Xiaorong Guo, Qingchen Fang, Ry Zhang, Xinpeng Zhou, Haipeng Wang
Title: FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
Abstract: Research on the intelligent interpretation of allweather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.
PaperID: 2319,   Poster  https://arxiv.org/pdf/2603.06366    
Authors: Yuxuan Fan, JING HAO, Hong Chen, Jiahao Bao, Yihua Shao, Yuci Liang, Kuo Hung, Hao Tang
Title: OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Abstract: Panoramic dental radiographs require finegrained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision–language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision–language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand–image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis. Code, benchmark, and models will be made publicly available.
PaperID: 2320,   Poster  https://arxiv.org/pdf/2511.11502    
Authors: Nhat Hoang, Minh Vu, My T. Thai, Manish Bhattarai
Title: PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
Abstract: Large vision–language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (“prelim”) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, trainingfree signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
PaperID: 2321,   Poster  https://arxiv.org/pdf/2603.11676    
Authors: Yongqi Ding, Kunshan Yang, Linze Li, Yiyang Zhang, Mengmeng Jing, Lin Zuo
Title: Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks
Abstract: Although the temporal spike dynamics of spiking neural networks (SNNs) enable lowpower temporal capture capabilities, they also incur inherent inconsistencies that severely compromise representation. In this paper, we perform dual consistency optimization via Stable Spike to mitigate this problem, thereby improving the recognition performance of SNNs. With the hardware-friendly ``AND" bit operation, we efficiently decouple the stable spike skeleton from the multi-timestep spike maps, which captures critical semantics and reduces the inconsistency from variable noise spikes. Enforcing the unstable spike maps to converge to the stable spike skeleton significantly improves the inherent consistency across timesteps. Furthermore, we inject amplitude-aware spike noise into the stable spike skeleton to diversify the representations while preserving consistent semantics. The SNN is encouraged to produce perturbation-consistent predictions, thereby contributing to generalization. Extensive experiments across multiple architectures and datasets validate the effectiveness and versatility of our method. In particular, our method significantly advances neuromorphic object recognition under ultra-low latency, improving accuracy by up to 8.33%. This will help unlock the full power consumption and speed potential of SNNs.
PaperID: 2322,   Poster  https://arxiv.org/pdf/2604.08883    
Authors: Chengjie Fan, Cong Pan, Zijian Liu, Ningzhong Liu, Jie Qin
Title: HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
Abstract: Inspired by the general Visionand-Language Navigation (VLN) task, aerial VLN has drawn widespread attention, owing to its significant application value in areas such as logistics delivery and urban inspection. However, existing methods in complex urban environments face several challenges, including insufficient generalization to unknown scenes, suboptimal performance in long-distance path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that blends Imitation Learning (IL) and Reinforcement Learning (RL) into a hybrid IL-RL paradigm. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance at all levels of scenes and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.
PaperID: 2323,   Poster  https://arxiv.org/pdf/2511.20302    
Authors: Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu
Title: CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
Abstract: In Remote Sensing (RS), ParameterEfficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work is provided in the supplementary material.
PaperID: 2324,   Poster  https://arxiv.org/pdf/2508.04728    
Authors: Shuo Chen, Yijin Li, Xi Zheng, Guofeng Zhang
Title: Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
Abstract: The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learningbased approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 μm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability.
PaperID: 2325,   Poster  https://arxiv.org/pdf/2603.27558    
Authors: Baoheng Zhang, Jiahui Liu, Zhao Gui, Zhang Weizhou, YIXUAN MA, Jun Jiang, Yingxian Chen, Wilton.W.T. Fok, Xiaojuan Qi, Hayden Kwok-Hay So
Title: Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) perform strong vision–language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose EventMLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator -- a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion -- and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05× – 20×), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.
PaperID: 2326,   Poster  https://arxiv.org/pdf/2511.22787    
Authors: Eunsu Kim, Junyeong Park, Na Min An, Jun Kim, Hitesh Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ilasariya, Hyunjung Shim, Alice Oh
Title: World in a Frame: Understanding Culture Mixing as a New Challenge for VisionLanguage Models
Abstract: In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large VisionLanguage Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background.Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts.To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
PaperID: 2327,   Poster  https://arxiv.org/pdf/2508.19786    
Authors: Han Jiao, Jiakai Sun, Yexing Xu, Lei Zhao, Wei Xing, Huaizhong Lin
Title: MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
Abstract: 3D Gaussian Splatting, known for enabling highquality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.
PaperID: 2328,   Poster  https://arxiv.org/pdf/2512.00075    
Authors: Jun Jia, Hongyi Miao, Yingjie Zhou, Wangqiu Zhou, Jianbo Zhang, Linhan Cao, Dandan Zhu, Hua Yang, Xiongkuo Min, Wei Sun, Guangtao Zhai
Title: Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
Abstract: With the rapid progress in diffusion models, image synthesis has advanced to the stage of zeroshot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.
PaperID: 2329,   Poster  https://arxiv.org/pdf/2505.19459    
Authors: kaichao jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao, Richang Hong
Title: Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation
Abstract: Joint Energybased Models (JEMs) are well known for their ability to unify classification and generation within a single framework. Despite their promising generative and discriminative performance, their robustness remains far inferior to adversarial training (AT), which, conversely, achieves strong robustness but sacrifices clean accuracy and lacks generative ability. This inherent trilemma—balancing classification accuracy, robustness, and generative capability—raises a fundamental question: Can a single model achieve all three simultaneously? To answer this, we conduct a systematic energy landscape analysis of clean, adversarial, and generated samples across various JEM and AT variants. We observe that AT reduces the energy gap between clean and adversarial samples, while JEMs narrow the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might bridge their performance disparities. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), a unified generative-discriminative-robust framework that maximizes the joint probability of clean and adversarial distribution. EB-JDAT introduces a novel min–max energy optimization to explicitly aligning energies across clean, adversarial, and generated samples. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet subsets demonstrate that EB-JDAT achieves state-of-the-art robustness while maintaining near-original accuracy and generation quality of JEMs, effectively resolving the triple trade-off between accuracy, robustness, and generation.
PaperID: 2330,   Poster  https://arxiv.org/pdf/2604.16680    
Authors: Yuval Haitman, Amit Efraim, Joseph M. Francos
Title: C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
Abstract: We introduce CGenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer, preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a “Match-then-Fuse” probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where no imagery data is available.
PaperID: 2331,   Poster  https://arxiv.org/pdf/2604.16481    
Authors: Hoigi Seo, Byung Hyun Lee, Jaehyun Cho, Sungjin Lim, Se Young Chun
Title: Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
Abstract: Largescale text-to-image (T2I) diffusion models deliver remarkable visual fidelity but pose safety risks due to their capacity to reproduce undesirable content, such as copyrighted ones. Concept erasure has emerged as a mitigation strategy, yet existing approaches struggle to balance scalability, precision, and robustness, which restricts their applicability to erasing only a few hundred concepts. To address these limitations, we present Erasing Thousands of Concepts (ETC), a scalable framework capable of erasing thousands of concepts while preserving generation quality. Our method first models low-rank concept distributions via a Student’s t-distribution Mixture Model (tMM). It enables pin-point erasure of target concepts via affine optimal transport while preserving others by anchoring the boundaries of target concept distributions without pre-defined anchor concepts. We then train a Mixture-of-Experts (MoE)–based module, termed MoEraser, which removes target embeddings while preserving the anchor embeddings. By injecting noise into the text embedding projector and fine-tuning MoEraser for recovery, our framework achieves robustness to white-box attack such as module removal. Extensive experiments on over 2,000 concepts across heterogeneous domains and diffusion models demonstrate state-of-the-art scalability and precision in large-scale concept erasure.
PaperID: 2332,   Poster  https://arxiv.org/pdf/2603.18757    
Authors: Haochen Li, Rui Zhang, Hantao Yao, Xin Zhang, Yifan Hao, Shaohui Peng, Yongwei Zhao, Ling Li
Title: DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
Abstract: Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain.Existing DAOD methods employ multigranularity feature alignment to learn domain-invariant representations.However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features.Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features.Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM).IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment.OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment.Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.
PaperID: 2333,   Poster  https://arxiv.org/pdf/2508.03404    
Authors: Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu
Title: Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
Abstract: The dominant paradigm of monolithic scaling in VisionLanguage Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding.This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9–11.5% over the base models. The source code will be released publicly.
PaperID: 2334,   Poster  https://arxiv.org/pdf/2508.15778    
Authors: YIFAN LIAO, Yuxin Cao, Yedi Zhang, Wentao He, Yan XIAO, Xianglong Du, Zhiyong Huang, Jin Song Dong
Title: Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach
Abstract: Deep learningbased lane detection (LD) plays a critical role in autonomous driving and advanced driver assistance systems. However, its vulnerability to backdoor attacks presents a significant security concern. Existing backdoor attack methods on LD often exhibit limited practical utility due to the artificial and conspicuous nature of their triggers. To address this limitation and investigate the impact of more ecologically valid backdoor attacks on lane detection models, we examine the common data poisoning attack and introduce DBALD, a novel diffusion-based data poisoning framework for generating naturalistic backdoor triggers. DBALD comprises two key components: optimal trigger position finding and stealthy trigger generation. Given the insight that attack performance varies depending on the trigger position, we propose a heatmap-based method to identify the optimal trigger location, with gradient analysis to generate attack-specific heatmaps. A region-based editing diffusion process is then applied to synthesize visually plausible triggers within the most susceptible regions identified previously. Furthermore, to ensure scene integrity and a stealthy attack, we introduce two loss strategies: one for preserving lane structure and another for maintaining the consistency of the driving scene. Consequently, compared to existing attack methods, DBALD achieves both a high attack success rate and superior stealthiness. Extensive experiments on 4 mainstream lane detection models show that DBALD exceeds state-of-the-art methods, with an average success rate improvement of +10.87% and significantly enhanced stealthiness. The experimental results highlight significant practical challenges in ensuring model robustness against real-world backdoor threats in lane detection. Our data and demos are available at https://sites.google.com/view/dbald.
PaperID: 2335,   Poster  https://arxiv.org/pdf/2511.21428    
Authors: Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner
Title: From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for VisionLanguage-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
PaperID: 2336,   Poster  https://arxiv.org/pdf/2603.13352    
Authors: Xi Chen, Maojun Zhang, Yu Liu, Shen Yan
Title: Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Abstract: Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While ParameterEfficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.
PaperID: 2337,   Poster  https://arxiv.org/pdf/2604.10127    
Authors: Longteng Jiang, DanDan Zheng, Qianqian Qiao, Heng Huang, Huaye Wang, Yihang Bo, Bao Peng, Jingdong Chen, JUN ZHOU, Xin Jin
Title: VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Abstract: The rapid advancement of AIGCbased video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment—particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.
PaperID: 2338,   Poster  https://arxiv.org/pdf/2603.13682    
Authors: Sungrae Hong, Jiwon Jeong, Jisu Shin, Donghee Han, Sol Lee, Kyungeun Kim, Mun Yi
Title: Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
Abstract: Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake–severity–aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severityweighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel’s Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.
PaperID: 2339,   Poster  https://arxiv.org/pdf/2603.24260    
Authors: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Title: Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
Abstract: Diffusionbased video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, diffusion transformers remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the Diffusion Transformer (DiT) itself that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model’s output.This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion transformers and video editing tasks. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy effectively reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves noticeable acceleration includes 2.67× latency speedup and noticeable FLOPs reduction over commonly used foundation models with negligible degradation in editing quality.
PaperID: 2340,   Poster  https://arxiv.org/pdf/2603.11640    
Authors: Sizhong Qin, Ramon Weber, Xinzheng Lu
Title: Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete roominstance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.
PaperID: 2341,   Poster  https://arxiv.org/pdf/2512.01223    
Authors: Beining Xu, Siting Zhu, Zhao Jin, Junxian Li, Hesheng Wang
Title: S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
Abstract: 3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multimodal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S^2-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S^2-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.
PaperID: 2342,   Poster  https://arxiv.org/pdf/2603.22153    
Authors: Liu Kejia, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang
Title: Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Abstract: Recent advances in crossview geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.
PaperID: 2343,   Poster  https://arxiv.org/pdf/2601.22150    
Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung
Title: Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Abstract: Large VisionLanguage Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question:do VLMs perceive visual changes or merely recall memorized patterns?While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focus on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls.Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change.
PaperID: 2344,   Poster  https://arxiv.org/pdf/2512.03463    
Authors: Shojiro Yamabe, Futa Kai Waseda, Daiki Shiono, Tsubasa Takahashi
Title: Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Abstract: Recent large vision–language models (LVLMs) have been applied to diverse VQA tasks.However, achieving practical performance typically requires taskspecific fine-tuning with large numbers of image-text pairs, which are costly to collect.In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling.Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available.Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort.While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image–text modality gap.To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas.This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost.Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do.Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model.We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility.Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
PaperID: 2345,   Poster  https://arxiv.org/pdf/2603.19766    
Authors: Donghai Fang, Yongheng Li, Zhen WANG, Yuansong Zeng, Wenwen Min
Title: Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
Abstract: Spatial transcriptomics (ST) enables spotlevel in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from H&E-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene–gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we proposeHINGE(HIstology-coNditionedGEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducingSoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-spacemasked diffusionobjective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, HINGE outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.
PaperID: 2346,   Poster  https://arxiv.org/pdf/2602.21637    
Authors: Di Zhang, Zhangpeng Gong, Xiaobo Pang, Jiashuai Liu, Junbo Lu, Hao Cui, Jiusong Ge, Zhi Zeng, Kai Yi, Yinghua Li, Si Liu, Tingsong Yu, Haoran Wang, Mireia Crispin-Ortuzar, Weimiao Yu, Chen Li, Zeyu Gao
Title: CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
Abstract: Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and nonuniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.
PaperID: 2347,   Poster  https://arxiv.org/pdf/2603.19516    
Authors: Yuanzhe Li, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Sheng Lu
Title: Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Abstract: Recent vision language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce GastricX, a large-scale multimodal benchmark for gastric cancer analysis. Each case in Gastric-X includes paired resting and dynamic CT scans, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
PaperID: 2348,   Poster  https://arxiv.org/pdf/2603.01163    
Authors: Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu, Hong Gu, Yanmei Fang
Title: BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
Abstract: Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental tradeoff. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences. Our code will be made publicly available.
PaperID: 2349,   Poster  https://arxiv.org/pdf/2602.03595    
Authors: Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu
Title: Refer-Agent: A Collaborative Multi-Agent System for Referring Video Object Segmentation with Reasoning and Reflection
Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on largescale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose Refer-Agent, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent’s visual focus.Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches.Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.
PaperID: 2350,   Poster  https://arxiv.org/pdf/2511.02779    
Authors: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Chaorui Deng, Shen Yan, Haoqi Fan, Yejin Choi, James Zou, Cihang Xie, Huaxiu Yao, Qinghao Ye
Title: When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Abstract: We propose MIRA (Multimodal Imagination for Reasoning Assessment), a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional Chainof-thought (CoT) methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images --- such as sketches, structural diagrams, or path drawings --- to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone (e.g., tracking a die’s movement on a board and summing the face-down values after each roll). To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT (Text-CoT) input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models (MLLMs), including strongest private models (e.g., GPT-5, o3, Gemini 2.5 Pro) as well as strong open-weight models (e.g., Qwen2.5-VL, GLM 4.5V), perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
PaperID: 2351,   Poster  https://arxiv.org/pdf/2602.22549    
Authors: Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Dongshuo Yin, Ziyao Lin, Cheng Lu
Title: DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate highdefinition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where existing methods fail, highlighting its strong generalization ability.
PaperID: 2352,   Poster  https://arxiv.org/pdf/2603.29252    
Authors: Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji
Title: Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Abstract: Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and trainingfree approach, termed Flexible Memory (FlexMem). In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one.To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on a single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than 1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, e.g. , GPT-4o and Gemini-1.5 Pro. Our code project is given in the supplementary materials.
PaperID: 2353,   Poster  https://arxiv.org/pdf/2512.00395    
Authors: Jiazhen Liu, Mingkuan Feng, Long Chen
Title: Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Abstract: Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixellevel objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma withall-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We presentSTAMP:SimultaneousTextualAll-MaskPrediction, an MLLM that embodies this paradigm. After generating a textual response,STAMPpredicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show thatSTAMPsignificantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.
PaperID: 2354,   Poster  https://arxiv.org/pdf/2501.14894    
Authors: Qiaojie Zheng, Jiucai Zhang, Amy Zhang, Xiaoli Zhang
Title: Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
Abstract: Accurate uncertainty estimation is essential for reliable appearancebased gaze tracking. However, domain shifts between training and testing often lead to incorrect uncertainty estimates, which is a problem overlooked in existing uncertainty-aware gaze tracking models. To overcome this problem efficiently, we formulate uncertainty estimation as a conditional distribution problem and treat the correction process as an output-level conditional distribution matching task. We therefore introduce a data-efficient post-hoc calibration method to align the predicted, high-error conditional distribution with the empirically observed distribution extracted from a small set of calibration samples. To more faithfully assess the accuracy of the resulting uncertainty estimates, we further introduce a new metric, Coverage Probability Error (CPE), to quantify the distribution-level mismatch between prediction and observation. We validate the calibration procedure across four domain shift scenarios to demonstrate improved uncertainty accuracy and its practical benefits.
PaperID: 2355,   Poster  https://arxiv.org/pdf/2603.10360    
Authors: Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi
Title: One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
Abstract: Current trainingfree methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating aunified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead. Codes are available in supplementary materials.
PaperID: 2356,   Poster  https://arxiv.org/pdf/2503.19740    
Authors: chengan che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis Carlos Garcia Peraza Herrera
Title: LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
Abstract: Traditional openaccess datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp of mAP in CholecT50), surgical tool presence detection (+5.3pp and +10.2pp of mAP in Cholec80 and GraSP), and surgical semantic segmentation (+8.3pp of mDice in CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide.
PaperID: 2357,   Poster  https://arxiv.org/pdf/2603.03615    
Authors: Haotian Zhang, Feiyue Long, Yixin Yu, Jian Xue, Haocheng Tang, Tongda Xu, Zhenning Shi, Yan Wang, Siwei Ma, Jiaqi Zhang
Title: Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
Abstract: Multiview image compression (MIC) aims to achieve high compression efficiency by exploiting inter-image correlations, playing a crucial role in 3D applications. As a subfield of MIC, distributed multi-view image compression (DMIC) offers performance comparable to MIC while eliminating the need for inter-view information at the encoder side.However, existing methods in DMIC typically treat all images equally, overlooking the varying degrees of correlation between different views during decoding, which leads to suboptimal coding performance. To address this limitation, we propose a novel OmniParallax Attention Mechanism (OPAM), which is a general mechanism for explicitly modeling correlations and aligned features between arbitrary pairs of information sources.Building upon OPAM, we propose a Parallax Multi Information Fusion Module (PMIFM) to adaptively integrate information from different sources. PMIFM is incorporated into both the joint decoder and the entropy module to construct our end-to-end DMIC framework, ParaHydra.Extensive experiments demonstrate that ParaHydra is the first DMIC method to significantly surpass state-of-the-art MIC codecs, while maintaining low computational overhead. Performance gains become more pronounced as the number of input views increases. Compared with LDMIC, ParaHydra achieves bitrate savings of 19.72% on WildTrack(3) and up to 24.18% on WildTrack(6), while significantly improving coding efficiency (as much as 65× in decoding and 34× in encoding).
PaperID: 2358,   Poster  https://arxiv.org/pdf/2603.12845    
Authors: Fei Wang, Xinye Zheng, Kun Li, Yanyan Wei, Yuxin Liu, Ganpeng Hu, Tong Bao, Jingwen Yang
Title: Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation
Abstract: Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number (k_\textcat), Michaelis constant (K_\textm), and inhibition constant (K_\texti) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines often simplify this process as a static compatibility problem between enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the EnzymeReaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.
PaperID: 2359,   Poster  https://arxiv.org/pdf/2512.22170    
Authors: Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Long Hu, Yuan Zhou, qinglin lu, yixue Hao, Junchi Yan
Title: SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
Abstract: Posttraining alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles.Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
PaperID: 2360,   Poster  https://arxiv.org/pdf/2603.12998    
Authors: Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan
Title: A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
Abstract: While VisionLanguage Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a closed-form solution in the cross-modal space, achieving Pareto-optimal fairness with bounded utility losses. Our method is training-free, requires no annotated data, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and intersectional fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance. Code will be made available upon acceptance.
PaperID: 2361,   Poster  https://arxiv.org/pdf/2602.24136    
Authors: Haoran Wang, Guoxi Huang, Fan Zhang, David Bull, Nantheera Anantrasirichai
Title: Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
Abstract: Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled realtime rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.
PaperID: 2362,   Poster  https://arxiv.org/pdf/2603.05970    
Authors: Jingtao Ye, zhang kexin, Xunchi Ma, Johann Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang
Title: Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
Abstract: The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multiobject tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at [link].
PaperID: 2363,   Poster  https://arxiv.org/pdf/2601.03054    
Authors: Yankai Jiang, Qiaoru Li, BinLu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai, Xuhong Zhang, Jianwei Yin
Title: IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Abstract: Recent research on medical MLLMs has gradually shifted its focus from imagelevel understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose IBISAgent—a novel agentic MLLM that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model’s robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.
PaperID: 2364,   Poster  https://arxiv.org/pdf/2511.18380    
Authors: Timing Yang, Feng Wang, Guoyizhe Wei
Title: RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
Abstract: Mamba, originally introduced for language modeling, has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba’s representational properties and make three primary contributions. First, we theoretically analyze Mamba’s relationship to Softmax and Linear Attention, confirming that it can be viewed as a lowrank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba’s capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba’s potential for interpretability. Notably, our model also achieves a 78.5% linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.
PaperID: 2365,   Poster  https://arxiv.org/pdf/2512.20340    
Authors: Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang
Title: The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
Abstract: Although diffusion transformer (DiT)based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently, two tailored keyframe-driven modules—the garment details enhancement module and the collaborative background optimization module—are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes. These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15,070 high-quality video samples at a resolution of 810 × 1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios. The dataset and code will be publicly released.
PaperID: 2366,   Poster  https://arxiv.org/pdf/2603.04598    
Authors: Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu, Dmitry Kislyuk
Title: PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single groundtruth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,846 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.
PaperID: 2367,   Poster  https://arxiv.org/pdf/2511.20256    
Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
Title: The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pretrained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduceAdv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we takethe image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models will be released.
PaperID: 2368,   Poster  https://arxiv.org/pdf/2604.09227    
Authors: Wongi Jeong, Hoigi Seo, Se Young Chun
Title: Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
Abstract: Image generative models have become indispensable tools to yield exquisite highresolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3× speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.
PaperID: 2369,   Poster  https://arxiv.org/pdf/2603.22915    
Authors: Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li
Title: When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
Abstract: AudioVisual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct MLD-VC, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing.
PaperID: 2370,   Poster  https://arxiv.org/pdf/2603.26777    
Authors: Renbo Tu, Ali SaraerToosi, Nicholas Conroy, Gennady Pekhimenko, Aviad Levis
Title: BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
Abstract: The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images, though they are costly to generate and impractical for inference, as exploring many physical configurations remains computationally intractable. Consequently, EHT analyses often resort to comparing observations with libraries of precomputed models. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry image, as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multiscale pyramid loss, we demonstrate how autoregressive prediction can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. By forecasting dynamics as a first step, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. This two-step approach makes BHCast more versatile and interpretable than direct inference of such features. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A and M87, by testing on simulated frames blurred to EHT resolution. In addition, we show an application of our forecaster on real EHT images of M87 .
PaperID: 2371,   Poster  https://arxiv.org/pdf/2604.17087    
Authors: Jiafei Song, Fengwei Zhou, Jin Qu, Wenjin (Jason) Li, Tong Wu, Gengjian Xue, Zhikang Zhao, Daomin Wei, Yichao Lu, Bailin Na
Title: EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on visionlanguage understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
PaperID: 2372,   Poster  https://arxiv.org/pdf/2603.02629    
Authors: Kaifang Long, Lianbo Ma, Jiaqi Liu, liming liu, Guoyang Xie
Title: Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
Abstract: The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IBIUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD. Code will be released.
PaperID: 2373,   Poster  https://arxiv.org/pdf/2512.02664    
Authors: Derui Shan, Qian Qiao, Hao Lu, Tao Du, Peng Lu
Title: PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
Abstract: Polarizationaware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.
PaperID: 2374,   Poster  https://arxiv.org/pdf/2511.19859    
Authors: Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, Sanglu Lu
Title: Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation
Abstract: VisionLanguage-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5%, 9.6% and 12.1% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.
PaperID: 2375,   Poster  https://arxiv.org/pdf/2510.21590    
Authors: Minxing Luo, Linlong Fan, Qiushi Wang, Ge Wu, Yiyan Luo, Yuhang Yu, Jinwei Chen, Yaxing Wang, Qingnan Fan, Jian Yang
Title: Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
Abstract: Current image superresolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduceTIGER(Text–ImageGuided supEr-Resolution), a novel two-stage framework that breaks this trade-off through a"text-first, image-later"paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-ST) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.
PaperID: 2376,   Poster  https://arxiv.org/pdf/2512.00818    
Authors: Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, jingjing liu, Kai WU, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, Hongwei Li
Title: Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Abstract: MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present MedCMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual–reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.
PaperID: 2377,   Poster  https://arxiv.org/pdf/2602.21497    
Authors: Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen
Title: See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
Abstract: Recent large visionlanguage models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps—even if logically valid—can still lead to incorrect final answers.Existing solutions attempt to mitigate this issue by training models to “think with images” via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures.Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model’s reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer.Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. our method achieves 16.5%–29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without any additional training.
PaperID: 2378,   Poster  https://arxiv.org/pdf/2512.02790    
Authors: Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing LYU, Zhou Zhao, Shengyu Zhang
Title: UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Abstract: With the rapid advances of powerful multimodal models such as GPT4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.
PaperID: 2379,   Poster  https://arxiv.org/pdf/2603.27259    
Authors: Seng Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang, Yu Zhang, Chao Li
Title: Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision–language models (VLMs) have made notable progress, existing benchmarks mainly focus on either finegrained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: Can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. Scene-RAG improves VLM performance by +7.11%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.
PaperID: 2380,   Poster  https://arxiv.org/pdf/2512.14423    
Authors: Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li
Title: The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Abstract: Trainingfree image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leveragesPositional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach. Our code will be publicly released.
PaperID: 2381,   Poster  https://arxiv.org/pdf/2603.23345    
Authors: Yujie Sun, Zhuoqiang CAI, Chaoyue Niu, Jianchuan Chen, Zhiwen Chen, Chengfei Lv, Fan Wu
Title: FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
Abstract: We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouples two components in texture space by representing the face with planar Gaussians and the hair with strandbased Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.
PaperID: 2382,   Poster  https://arxiv.org/pdf/2601.08151    
Authors: Shezheng Song, Shasha Li, Shan Zhao, Xiaopeng Li, Qian Wan, Chengyu Wang, Tianwei Yan, Ma Jun, Jie Yu
Title: Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision–language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layerwise masking analysis across multiple architectures, revealing how visual–text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage “review” phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance.Code will be released.
PaperID: 2383,   Poster  https://arxiv.org/pdf/2603.22821    
Authors: Zhiceng Shi, Changmiao Wang, Jun Wan, Wenwen Min
Title: Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
Abstract: While spatial transcriptomics (ST) has advanced understanding of gene expression within tissue context, its high experimental cost limits largescale application. Predicting ST from pathology images offers a promising, cost-effective alternative, yet existing methods often struggle to capture the complex spatial relationships across slides. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer across slides. Additionally, SpaHGC incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Moreover, the model’s predicted ST profiles closely align with the ground truth data and accurately correspond to tumor regions. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential. Code availability and reproducibility details are in the Supplementary Materials.
PaperID: 2384,   Poster  https://arxiv.org/pdf/2601.15475    
Authors: Yunshan Qi, Lin Zhu, Nan Bao, Yifan Zhao, Jia Li
Title: Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensorphysics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results from single-exposure blurry LDR images and corresponding events.
PaperID: 2385,   Poster  https://arxiv.org/pdf/2602.24059    
Authors: Chenwei Jia, Baoting Li, Xuchong Zhang, Mingzhuo Wei, Bochen Lin, Hongbin Sun
Title: Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Abstract: PostTraining Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose Quant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.
PaperID: 2386,   Poster  https://arxiv.org/pdf/2512.00539    
Authors: Yongkang Hu, Yu Cheng, YuShuo Zhang, Yuan Xie, Zhaoxia Yin
Title: SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
Abstract: The widespread misuse of image generation technologies has raised security concerns, driving the development of AIgenerated image detection methods. However, generalization has become a key challenge and open problem: existing approaches struggle to adapt to emerging generative methods and content types in real-world scenarios. To address this issue, we propose a Scene-Aware and Importance-Guided Dynamic Optimization detection framework with continual learning (SAIDO). Specifically, we design Scene-Awareness-Based Expert Module (SAEM) that dynamically identifies and incorporates new scenes using VLLMs. For each scene, independent expert modules are dynamically allocated, enabling the framework to capture scene-specific forgery features better and enhance cross-scene generalization. To mitigate catastrophic forgetting when learning from multiple image generative methods, we introduce Importance-Guided Dynamic Optimization Mechanism (IDOM), which optimizes each neuron through an importance-guided gradient projection strategy, thereby achieving an effective balance between model plasticity and stability. Extensive experiments on continual learning tasks demonstrate that our method outperforms the current SOTA method in both stability and plasticity, achieving 44.22% and 40.57% relative reductions in average detection error rate and forgetting rate, respectively. On open-world datasets, it improves the average detection accuracy by 9.47% compared to the current SOTA method.
PaperID: 2387,   Poster  https://arxiv.org/pdf/2603.24079    
Authors: Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang
Title: When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Our work shows that MLLMs pair usability with higher risks, highlighting the need for adaptive safeguards to mitigate realworld harms.Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks.Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis.Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content.For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs.Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.
PaperID: 2388,   Poster  https://arxiv.org/pdf/2602.22159    
Authors: Wenhao Guo, Zhaoran Zhao, Peng Lu, Sheng Li, Qian Qiao, Derui LI
Title: CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
Abstract: ArbitraryScale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.
PaperID: 2389,   Poster  https://arxiv.org/pdf/2604.08922    
Authors: Yu Shi, Yu Liu, Zhong-Cheng Wu, Juan Cheng, Huafeng Li, Xun Chen
Title: Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
Abstract: Complex degradations like noise, blur, and low resolution are typical challenges in realworld image fusion tasks, limiting the performance and practicality of existing methods. End-to-end neural network–based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion-based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single-domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation-aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.
PaperID: 2390,   Poster  https://arxiv.org/pdf/2512.24074    
Authors: Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao
Title: Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
Abstract: Finegrained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.
PaperID: 2391,   Poster  https://arxiv.org/pdf/2602.22091    
Authors: Matthew Strong, Wei-Jer Chang, Quentin HERAU, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan
Title: Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Abstract: Egocentric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic layouts, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and motion prediction tasks. These geometry- and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.
PaperID: 2392,   Poster  https://arxiv.org/pdf/2508.06878    
Authors: Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao, Yimian Dai, Xingxing Wei
Title: Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
Abstract: Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNNbased methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarm problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the IRSTD-1k and NUAA-SIRST datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS task.
PaperID: 2393,   Poster  https://arxiv.org/pdf/2603.05095    
Authors: Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Title: GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense framelevel labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification–regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
PaperID: 2394,   Poster  https://arxiv.org/pdf/2604.10971    
Authors: Xincheng Yao, Zefeng Qian, Chao Shi, Jiayang Song, Chongyang Zhang
Title: MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
Abstract: In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional singleand multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.
PaperID: 2395,   Poster  https://arxiv.org/pdf/2603.00882    
Authors: Zhangxing Bian, Shuwen Wei, Samuel Remedios, Junyu Chen, Aaron Carass, Blake Dewey, Jerry L Prince
Title: Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
Abstract: Tagged MRI enables tracking internal tissue motion noninvasively. It encodes motion by modulating anatomy with periodic tags, which deforms along with tissue. However, the entanglement between anatomy, tags and motion poses significant challenges on post processing. Existence of tags and imaging blur hinders downstream tasks such as segmenting anatomy. Tag fading, due to T1-relaxation, disrupts brightness constancy assumption for motion tracking. For decades, these challenges are handled in isolation and sub-optimally. In contrast, we introduce a blind and nonlinear inverse framework for tagged MRI that, for the first time, unifies these tasks: anatomical image recovery, high-resolution cine image synthesis, and motion estimation. At its core, the synergy of MR physics and generative priors enables us to blindly estimate the unknown forward imaging models, high-resolution underlying anatomy, while simultaneously tracking 3D diffeomorphic Lagrangian motion over time. Experiments on tagged brain MRI demonstrate that our approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.
PaperID: 2396,   Poster  https://arxiv.org/pdf/2603.11439    
Authors: Seung Hyup Baek, Jimin Lee, Hyeongkeun Lee, Jae Won Cho
Title: Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
Abstract: Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While querybased frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.
PaperID: 2397,   Poster  https://arxiv.org/pdf/2603.26400    
Authors: Le Ma, Thiago Santos, Nadia Thalmann, Katarzyna Wac
Title: SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Abstract: In surgical training for medical students, proficiency development relies on expertled skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. SHands captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees) each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands will be publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.
PaperID: 2398,   Poster  https://arxiv.org/pdf/2603.16233    
Authors: Ryosuke Hori, Jyun-Ting Song, Zhengyi Luo, Jinkun Cao, Soyong Shin, HIDEO SAITO, Kris Kitani
Title: Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
Abstract: We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMUonly approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU–pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency. Code and data will be released upon acceptance.
PaperID: 2399,   Poster  https://arxiv.org/pdf/2602.23339    
Authors: Tilemachos Aravanis, Vladan Stojnić, Vasileios Psomas, Nikos Komodakis, Giorgos Tolias
Title: Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Abstract: Openvocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision–language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and adapts seamlessly to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
PaperID: 2400,   Poster  https://arxiv.org/pdf/2603.27240    
Authors: Jinhu Fu, Yihang Lou, Qingyi Si, Shudong Zhang, Sen Su
Title: Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Abstract: Large VisionLanguage Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs. We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity.Extensive experiments on multiple LVLM safety benchmarks demonstrate that our causal–subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Moreover, our approach demonstrates robust transferability, effectively defending against unseen and adaptive attacks.
PaperID: 2401,   Poster  https://arxiv.org/pdf/2604.03685    
Authors: Hoonhee Cho, Jae-Young Kang, Yuhwan Jeong, Yunseo Yang, Wonyoung Lee, Youngho Kim, Kuk-Jin Yoon
Title: DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
Abstract: In this paper, we present DSERTRoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting. We will make our code and dataset publicly available.
PaperID: 2402,   Poster  https://arxiv.org/pdf/2512.04513    
Authors: Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu
Title: BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
Abstract: Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse realworld tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM’s latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM’s latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM’s semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Modeling, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
PaperID: 2403,   Poster  https://arxiv.org/pdf/2604.00452    
Authors: Halima Bouzidi, Haoyu Liu, Yonatan Achamyeleh, Praneetsai Iddamsetty, Mohammad Al Faruque
Title: Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Abstract: Recent Trackingby-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces un-explored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally-consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for the physical world realizability by leveraging differentiable simulations of advanced perception sensor spoofing methods. Experiments on MOT17 and MOT20 demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.
PaperID: 2404,   Poster  https://arxiv.org/pdf/2604.08617    
Authors: Zhuang Qi, Yingpeng Tang, Lei Meng, Guoqing Chao, Lei Wu, Han Yu, Xiangxu Meng
Title: From Selection to Scheduling: Federated GeometryAware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Abstract: Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sampleimportance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a federated geometry-aware correction method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced data distributions. Extensive experiments on three benchmark datasets demonstrate that FEAT substantially achieves a 4%–8% improvement in Top-1 accuracy compared to nine state-of-the-art methods.
PaperID: 2405,   Poster  https://arxiv.org/pdf/2512.21516    
Authors: Hongqing He, Jie Xu, Wenyuan Yang, Yonghua Zhu, Guoqiu Wen, Xiaofeng Zhu
Title: Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data
Abstract: Recently, contrastive learning (CL) plays an important role in exploring complementary information for multiview clustering (MVC) and has attracted increasing attention. Nevertheless, real-world multi-view data suffer from data incompleteness or noise, resulting in rare-paired samples or mis-paired samples which significantly challenges the effectiveness of CL-based MVC. That is, rare-paired issue prevents MVC from extracting sufficient multi-view complementary information, and mis-paired issue causes contrastive learning to optimize the model in the wrong direction. To address these issues, we propose a unified CL-based MVC framework for enhancing clustering effectiveness on incomplete and noise multi-view data. First, to overcome the rare-paired issue, we design a global-graph guided contrastive learning, where all view samples construct a global-view affinity graph to form new sample pairs for fully exploring complementary information. Second, to mitigate the mis-paired issue, we propose a local-graph weighted contrastive learning, which leverages local neighbors to generate pair-wise weights to adaptively strength or weaken the pair-wise contrastive learning. Our method is imputation-free and can be integrated into a unified global-local graph-guided contrastive learning framework. Extensive experiments on both incomplete and noise settings of multi-view data demonstrate that our method achieves superior performance compared with state-of-the-art approaches.
PaperID: 2406,   Poster  https://arxiv.org/pdf/2603.02618    
Authors: Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang
Title: Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
Abstract: Outof-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.
PaperID: 2407,   Poster  https://arxiv.org/pdf/2508.02443    
Authors: Thomas Gottwald, Edgar Heinert, Peter Stehr, Chamuditha Jayanga Galappaththige, Matthias Rottmann
Title: PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
Abstract: We introduce Primitivebased Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncertainty feature maps and regression models to understand how their interactions affect prediction accuracy and generalization.PRIMU also enables an effective active view selection strategy by directly leveraging these uncertainty feature maps.Additionally, we study the effect of separating splatting into foreground and background regions.Our estimates show strong correlations with true errors, outperforming state-of-the-art methods, especially for depth UE and foreground objects.Finally, our regression models show generalization capabilities to unseen scenes, enabling UE without additional holdout data.
PaperID: 2408,   Poster  https://arxiv.org/pdf/2603.14184    
Authors: Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiao-Hui Li
Title: Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Abstract: Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multistep reasoning, model's visual attention becomes scattered and drifts away from question-relevant regions, effectively “losing focus” on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between model’s overall attention on image tokens and the spatial dispersiveness of model’s attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy–focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy, while offering interpretable insights into how MLLMs process visual information.
PaperID: 2409,   Poster  https://arxiv.org/pdf/2604.05296    
Authors: Daniel George, Charles Yeh, Daniel Lee, Yifei Zhang
Title: From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Abstract: Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on facecontaining data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face–context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.
PaperID: 2410,   Poster  https://arxiv.org/pdf/2603.21484    
Authors: Hyundong Jin, Dongyoon Han, Eunwoo Kim
Title: Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Abstract: Continual unlearning poses the challenge of enabling large visionlanguage models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.
PaperID: 2411,   Poster  https://arxiv.org/pdf/2604.18623    
Authors: Xin Hu, Ke Qin, Wen Yin, Yuan-Fang Li, Ming Li, Tao He
Title: Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Abstract: Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject–predicate–object triples. Yet most pipelines treat SGG as a oneshot, deterministic classification instead of a genuine progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete–continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., the continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors/segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete–continuous generative formulation over one-shot classification baselines, e.g., an average improvement of about 3 points over the SOTA USG-Par.
PaperID: 2412,   Poster  https://arxiv.org/pdf/2511.18385    
Authors: PengChuang PengChuang, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei
Title: Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
Abstract: Automatic Xray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modal, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a dual-view caption corpus consisting of 45,613 dual-view images across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enale this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: top , side, conclusion. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new sight for real-world X-ray inspection.
PaperID: 2413,   Poster  https://arxiv.org/pdf/2603.27139    
Authors: Shivang Chopra, Shaunak Halbe, Chengyue Huang, Brisa Maneechotesuwan, Zsolt Kira
Title: The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
Abstract: Finetuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 8.9% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.
PaperID: 2414,   Poster  https://arxiv.org/pdf/2602.20951    
Authors: Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee, Dongmin Park
Title: See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Abstract: Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pretraining and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we proposeArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using our pipeline, we collect 100K pairwise instances and demonstrate both efficacy and versatility across diverse applications. Code is available at https://anonymous.4open.science/r/ArtiAgent.
PaperID: 2415,   Poster  https://arxiv.org/pdf/2604.03454    
Authors: Ganlin Feng, Yuxi Long, Hafsa Ali, Erin Lou, Fahad Butt, Qian Liu, Yang Wang, Pingzhao Hu
Title: RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
Abstract: Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AIassisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision–language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.
PaperID: 2416,   Poster  https://arxiv.org/pdf/2603.13787    
Authors: Junjie Zhou, Bao Xue, Meiling Wang, WEI SHAO, Daoqiang Zhang
Title: Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective
Abstract: To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop GeneRegulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.
PaperID: 2417,   Poster  https://arxiv.org/pdf/2506.05428    
Authors: Zhihao Tang, Chaozhuo Li, Litian Zhang, Xi Zhang
Title: Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
Abstract: Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a tradeoff between immediacy—making fast predictions from a single baseline sMRI—and accuracy—leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven “linguistic compass” is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5–12%.
PaperID: 2418,   Poster  https://arxiv.org/pdf/2511.22442    
Authors: Sébastien Piérard, Adrien Deliege, Marc Van Droogenbroeck
Title: What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$
Abstract: Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so, it has been proposed to take a weighted harmonic mean, known as the Fscore, F-measure, or F_\beta. Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of F_\beta scores in the literature, some clarification is in order. Concretely: (1) We establish that F_\beta-induced rankings are meaningful and define a shortest path between precision- and recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that F_1 and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for \beta for any distribution or set of performances, and we illustrate their use on six case studies.
Paperid: 2419,   Poster  
Authors: Andrew Xie, Dongyu Du, Sotiris Nousias, David B. Lindell, Kiriakos Kutulakos
Title: Inter-Photon-Limited Videography
Abstract: We consider the problem of imaging a dynamic scene when scene appearance variations can outpace photon arrivals. Under such conditions, a pixel is effectively ``blind'' to changes in appearance that occur within the timespan separating the photons it detects, and so the interphoton interval presents a significant speed barrier to video acquisition systems. To analyze and advance imaging capabilities at the inter-photon limit, we introduce a novel reparameterization of time-varying flux that reveals the intrinsic difficulty of signal reconstruction by relating the Fourier decomposition of a flux function to the number of photons arriving within each oscillation period. We find that inter-photon-limited videography of general scenes is underexplored and beyond the reach of existing reconstruction techniques. To this end, we introduce Neural Flux Fields---a technique that combines statistical modeling of photon arrival with intrinsic priors of a neural network to achieve robust videography at the inter-photon limit. Using this approach, we demonstrate never-before-scene capabilties in video reconstruction across a range of captured single-photon video datasets spanning the inter-photon-limited regime.
Paperid: 2420,   Poster  
Authors: Xiaopei Zhu, Zeyuan Li, Jun Zhu, Xiaolin Hu
Title: Mirror Illusion Art
Abstract: Mirror Illusion Art is a novel reflectionconditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design.
Paperid: 2421,   Poster  
Authors: Gaku Nakano
Title: Affine Perspective-Three-Point Problem
Abstract: This paper addresses the PerspectiveThree-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.
Paperid: 2422,   Poster  
Authors: Hyeon-Jin Jung, Han-Jin Lee, Seok-Hwan Choi
Title: Streamlined Knowledge Distillation
Abstract: Logitbased Knowledge Distillation (KD) has emerged as a lightweight alternative to feature-based KD. Recent logit-based methods often rely on multi-knowledge alignment and relational modeling. These methods are often inefficient due to redundant objectives, suboptimal transformations, and poorly designed loss functions. Motivated by these issues, we propose Streamlined Knowledge Distillation (SKD), a simple yet effective logit-based method that transfers only two essential forms of knowledge without requiring additional alignment or relational modeling. Specifically, SKD transfers instance-wise knowledge via Kullback-Leibler divergence and direction-wise knowledge by aligning the Gramian matrix of normalized logits. For the latter, we introduce a Mahalanobis distance-based direction-wise loss stabilized through Tikhonov regularization and Cholesky decomposition. This direction-wise loss accounts for variance and correlation in the output space and, as we formally show, is equivalent to the L2-norm in a covariance-whitened space. Extensive experiments demonstrate that SKD consistently outperforms existing logit-based methods and even surpasses feature-based methods, despite its simpler design. Code is available at \urlhttps://anonymous.4open.science/r/StreamLined-DF23/.
Paperid: 2423,   Poster  
Authors: Soumik Mukhopadhyay, Prateksha Udhayanan, Abhinav Shrivastava
Title: Scale Space Diffusion
Abstract: Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scalespace theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion method. To support Scale-Space Diffusion we introduce FlexiUNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths.
Paperid: 2424,   Poster  
Authors: Nakul Sharma, Aayush Bansal, Minh Vo
Title: Clothe and Pose
Abstract: We present Clotheand-Pose, an image generation and editing method that enables a user to try-on different clothes and allows them to pose as they desire. Our method inputs a single user image, garment image and example pose, and outputs the user wearing the target garment in desired pose. In this work, we also introduce an evaluation setup for clothing and posing tasks. Our study spans across a wide variety of garments, including athletic wear, bottoms, dresses, innerwear and swimwear, and a diverse set of poses. Finally, we demonstrate the capability of our model to do general human editing in real-world captures, as well as artificially generated images.
Paperid: 2425,   Poster  
Authors: Caiyang Yu, Chen Huang, Yun Liu, Chenwei Tang, Wei Ju, Jiancheng Lv
Title: Progressive Neural Architecture Generation
Abstract: As a representative technique in neural architecture search, neural architecture generation aims to construct highperformance architectures for a given task directly. It is poised to replace the inefficient random exploration components of some search strategies, such as the acquisition strategies in Bayesian optimization. Despite significant research, current architecture generation techniques face problems such as low generation efficiency and insufficient constraints, leading to invalidly generated architectures. To this end, we propose Progressive Neural Architecture Generation (PNAG), which constructs architectures incrementally through coarse-to-fine evolution, enhancing generation efficiency, and incorporates step-wise refinements to ensure the validity of the generated architecture. To achieve this, PNGA involves two modules, multi-scale sub-architecture quantization (MSQ) and step-wise consistency constraint (SCC). Specifically, MSQ constructs sub-architectures using quantization decoding and progressively expands them, transitioning from simple to complex forms. This operation bypasses network inference to enhance efficiency.Complementing MSQ, SCC, implemented through a tailored regularization mechanism, introduces penalties for deviations during sub-architecture generation, guiding the process towards valid target architectures. As such, PNAG establishes a clear generation path, laying the groundwork for generating suitable architectures in downstream tasks. Extensive experiments demonstrate that PNAG not only generates superior architectures for various downstream tasks (+8.43%/+5.07%, on average) but also significantly improves generation efficiency, reducing the architecture generation time by 1300×. Furthermore, PNAG demonstrates strong extensibility by successfully generating Transformer-based architectures.
Paperid: 2426,   Poster  
Authors: Mohamed Abdelfattah, Bugra Tekin, Fadime Sener, Necati Cihan Camgoz, Eric Sauser, Shugao Ma, Alex Alahi, Edoardo Remelli
Title: OSMO: Open-vocabulary Self-eMOtion Tracking
Abstract: We introduce the novel task of egocentric selfemotion tracking, which aims to infer an individual's evolving emotions from egocentric multimodal streams such as voice, visual surroundings, semantic subtext, and eye-tracking signals. To establish this research direction, we present: (1) OSMO dataset, a large-scale annotation effort on 110 hours of existing bilingual smart-glasses recordings, establishing the largest egocentric emotion dataset and the first with subject-wise emotion timelines; (2) OSMO benchmark, a suite of five tasks (emotion recognition, sentiment, intensity, localization, and reasoning), that redefine emotion understanding as a continuous, context-aware process rather than discrete classification of trimmed videos; (3) OSIRIS, a large multimodal model that tracks emotions over time by reasoning over the user's personal emotion history, current expressions, and egocentric observations. Extensive evaluations show that OSIRIS achieves a state-of-the-art performance, delivering, for the first time, coherent emotion timelines from egocentric data. Dataset, model, and codes will be fully open-sourced upon publication.
Paperid: 2427,   Poster  
Authors: Shengxi Wu, Sophia Yang, Dorian Chan, Matthew O’Toole
Title: Computational Speckle Pattern Interferometry
Abstract: Visually imperceptible surface deformations encode rich informationfrom the mechanical properties of an object to the acoustic vibrations present in the surrounding environment. Existing optical techniques reveal these subtle motions by employing coherent illumination and capturing multiple measurements over time. In this paper, we introduce Computational Speckle Pattern Interferometry (CSPI), a novel single-shot approach that estimates per-pixel displacement and motion by leveraging a phasor-based image formation model and an optical-flow-inspired reconstruction algorithm. Our key insight is that the image formation process can be factorized to jointly recover spatial coefficients and temporal dynamics. Unlike traditional interferometric methods, CSPI requires no precision instrumentation to perform phase stepping. We demonstrate its effectiveness by measuring per-pixel displacements and motions at sub-micrometer scales, visualizing high-frequency vibrations of a tuning fork and a Chladni plate, and recovering sound indirectly from these vibrations.
Paperid: 2428,   Poster  
Authors: Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu
Title: Ultra-Fast Neural Video Compression
Abstract: While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to realworld deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. Both training and testing codes will be released.
Paperid: 2429,   Poster  
Authors: Lixiong Chen, Bohan Yu, Victor Adrian Prisacariu, Imari Sato
Title: Variational Graph-based Normal Integration
Abstract: We present a general optimizationbased framework for depth-preserving normal integration. Unlike existing methods that operate on surface orientations defined over regular grids, our approach introduces a unified graph-based formulation capable of integrating semi-differentiable surfaces on unstructured domains. Given a set of points uniformly sampled from a surface, we construct a directed, weighted graph that jointly parameterizes the surface geometry and pairwise point correlations. Surface depth is recovered by minimizing projected point-to-plane distances across the graph, and this objective is optimized through variational inference. In our formulation, estimated surface normals serve as latent variables that encode local geometry via the posterior probabilities of a two-component Gaussian mixture, allowing depth discontinuities to be inferred from sampled triplet configurations. The unknowns are estimated in an alternating fashion, and we provide a geometric interpretation of this inference process by relating it to shape deformation. Experimental results show that the proposed method not only outperforms state-of-the-art techniques on regularly gridded data, but also generalizes effectively to scattered points, which existing approaches do not directly support.
Paperid: 2430,   Poster  
Authors: Jun Myeong Choi, Jae Shin Yoon, Luchao Qi, Roni Sengupta, Joon-Young Lee
Title: Relightful Video Portrait Harmonization
Abstract: We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and nonscalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.
Paperid: 2431,   Poster  
Authors: Jaesung Rim, Woohyeok Kim, Haeyun Lee, Heemin Yang, Ke Wang, Sunghyun Cho
Title: Gyro-based Deep Video Deblurring
Abstract: Modern cameras, such as smartphone cameras and DSLRs, are equipped with gyro sensors that measure motion of the camera. While the motion information is valuable for deblurring, gyrobased deblurring has not been widely studied, particularly for video. A few gyro-based video deblurring methods have been proposed, but they exhibit inherent limitations. First, gyro sensors capture only rotational motion, leading these methods to ignore translational motion. Second, their dependence on simplified blur models and deconvolution-based solutions restricts overall performance. To address these limitations, we introduce GyroDVD, the first learning-based framework for gyro-based video deblurring. We propose a novel blur kernel construction scheme that jointly accounts for rotational and translational motion. A video deblurring network then restores sharp videos by exploiting the constructed kernels together with the video frames. For training and evaluation, we introduce the GyroVD dataset, a large-scale and realistic dataset specifically designed for gyro-based deblurring. Extensive experiments demonstrate that our method significantly outperforms prior gyro-based image and video deblurring methods. Code and dataset will be made publicly available on our project page.
Paperid: 2432,   Poster  
Authors: Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, Ruojin Cai
Title: Long-Tail Internet Photo Reconstruction
Abstract: Internet photo collections exhibit an extremely longtailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, demonstrating emergent symmetry disambiguation while preserving generalization to standard 3D benchmarks.
Paperid: 2433,   Poster  
Authors: Sen Jia, Huayu Wang, Hsiang-Wei Huang, Zhaochong An, Jenq-Neng Hwang, Zhang Huaping, Lei Li
Title: CLEP: Contrastive Language-Pose Pretraining
Abstract: Aligning natural language descriptions with precise 3D human poses remains a big challenge due to the scarcity of effective pose representation mechanisms and largescale, semantically rich datasets. To overcome these limitations, we first introduceCLEP-2M, the largest 3D pose-language dataset to date, comprising two million high-quality 3D pose-language pairs. This dataset provides a20-foldincrease in scale and far richer semantic diversity than existing benchmarks. Second, we proposeCLEP, a novel contrastive pretraining framework. The core of CLEP is HierFormer, a hierarchical pose encoder specifically designed for language alignment. Its key innovation is a Cross-Scale Attention Fusion (CSAF) mechanism that dynamically integrates features from the joint, limb, and body levels. This enables CLEP to precisely align complex, multi-scale text descriptions with the pose representation. Extensive experimental evaluations on CLEP-2M and PoseScript demonstrate that our method consistently outperforms existing approaches across a range of downstream tasks. CLEP shows exceptional zero-shot generalization, achieving a 34.8 mRecall on the human-annotated PoseScript-H benchmark—a nearly6-foldimprovement from the baseline. Furthermore, CLEP demonstrates superior performance on pose generation and fine-grained pose editing. These results establish CLEP as a strong multimodal foundation model for human-centric understanding and generation tasks.
Paperid: 2434,   Poster  
Authors: yi ding, Qi Tao, Xingxing Liang, Longfei Zhang, Yiqin Lv, weitao song, Fangjie Yang, Qi Wang, Guangquan Cheng
Title: Neural Mixture Density Processes
Abstract: The neural process (NP) is a probabilistic meta learning model that learns distributions over functions via a global latent variable.It enables fast adaptation in fewshot scenarios by leveraging past experience. However, the design of latent variable structures and conditioning mechanisms in NPs remains underexplored, despite their importance in capturing diverse functional distributions.This paper proposes a new variant of NPs via mixture density modeling, referred to as the neural mixture density process (NMDP).The NMDP decomposes model parameters into task-agnostic and task-specific components to represent function distributions more flexibly. We train the model via the Expectation–Maximization algorithm to construct expressive functional priors.Compared with existing work, our method maintains several advantages: (i) less overfitting by updating a small part of the network parameters, (ii) compact task representation via distributions in the simplex,(iii) an improvement guarantee of generative likelihoods over iteration. Experimental results show that our method can achieve competitive performance with adequate explainability.
Paperid: 2435,   Poster  
Authors: Dongyu Du, Mingkun Zhao, Yutong Yang, Dominik Scheuble, Xiaolong Huang, Zijian Shao, Mario Bijelic, Kaushik Sengupta, Felix Heide
Title: X-band Radar Non-Line-of-Sight Imaging
Abstract: Conventional imaging systems capture objects visible in the direct lineof-sight (LOS). A decade of research on non-line-of-sight (NLOS) imaging approaches has made it possible to reconstruct hidden geometry outside the line of sight by analyzing indirect light transport. However, most existing methods operate in the optical visible or IR range. Relying on diffuse inter-reflections, every bounce incurs a quadratic intensity falloff. As such, with illumination power limited by eye-safety limitations, existing methods are fundamentally restricted to short ranges on the order of a few meters. We propose an X-band radar-based NLOS imaging method that leverages the long wavelength to convert diffuse reflections into predominantly specular ones, allowing for large-scale hidden-scene perception. We develop a neural reconstruction method that combines a learned dense prediction module and a geometry-aware NLOS reconstruction module, tackling the inherently low spatial resolution of long-wavelength radar. We assess our method using a prototype system and in simulation. Synthetic validation shows that, under the same transmit power, X-band radar achieves 10× longer NLOS reconstruction range than optical systems, while experimental results further demonstrate accurate hidden-object reconstructions up to 40 m, establishing a practical pathway toward real-world long-range NLOS sensing.
Paperid: 2436,   Poster  
Authors: Bingxuan Dai, Hongsong Wang, Jie Gui
Title: Property-Informed Diffusion-Based Text-to-Microstructure Generation
Abstract: Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a propertyinformed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. The proposed framework has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery.
Paperid: 2437,   Poster  
Authors: Agastya Kalra, Tim Salzmann, Guy Stoppi, Dmitrii Marin, Rishav Agarwal, Vage Taamazyan, Martin Bokeloh, Stefan Hinterstoisser, Anton Boykov, Alberto Dall'Olio, Pravin Dangol, Kartik Venkataraman, Huaijin Chen
Title: 3D-Object Perception Transformer (3PT)
Abstract: Current approaches to zeroshot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \algshort is well-suited for practical industrial robotics applications such as bin picking and precise insertion.
Paperid: 2438,   Poster  
Authors: Xingfeng Li, Hao Pan, Honglin Yuan, Yuan Sun, Xujian Zhao, Jia-Qi Lin, Zhenwen Ren
Title: Anti-Degradation Lifelong Multi-View Clustering
Abstract: In realworld scenarios, new views are continuously collected over time, forming a dynamic view stream. To handle such evolving data, a lifelong multi-view clustering framework is needed instead of a static model. However, large discrepancies across views make it challenging to learn new knowledge while preserving previously acquired information. There are few methods use consistency alignment or knowledge distillation to align new knowledge with old ones. However, these strategies cannot fundamentally prevent knowledge degradation, since new knowledge inevitably interferes with the learned representation space. To overcome this limitation, we propose a new Anti-degradation Lifelong Multi-view Clustering (ALMC) framework. Specifically, we innovatively propose a null-space-projection knowledge base anti-degradation technique, which ensures that new knowledge updates to the model only occur in directions orthogonal to the retained knowledge, thus preventing catastrophic forgetting of knowledge and degradation of clustering performance, and provides theoretical proof for this. Extensive experiments on multiple multi-view benchmark datasets demonstrate superior performance in multi-view clustering.
Paperid: 2439,   Poster  
Authors: Junfu Tan, Jiang Yuan
Title: Low-Rank Residual Diffusion Models
Abstract: Diffusion models have achieved remarkable progress in image generation and restoration. However, most frameworks assume a fullrank residual space, neglecting its inherent low-dimensional structure in near-domain transformations such as deraining and deblurring. We propose the Low-Rank Residual Diffusion Model (LRDM), which performs diffusion within a compact low-rank residual subspace for efficient and structure-preserving restoration. We establish the Low-Rank Residual Assumption, showing that the variational lower bound becomes tighter when residuals lie in a low-rank space. LRDM further introduces an Asymmetric Residual Diffusion Process, constraining the forward process in the low-rank domain while maintaining full-rank flexibility in the reverse direction. An Adaptive Rank Selection mechanism dynamically adjusts the rank across timesteps to capture varying residual complexity. Experiments on deraining, deblurring, and deshading benchmarks show that LRDM surpasses full-rank diffusion baselines and achieves state-of-the-art performance, validating the advantage of modeling diffusion in a low-rank residual space.
Paperid: 2440,   Poster  
Authors: Xiaojie Li, Yang Zhao, Ming Li, Yancheng Zhang, Zonglin Lyu, Yunpeng Chen, Rui Wang, Daquan Zhou
Title: Rethinking the Semantic-based Autoencoder
Abstract: Latent generative modeling has emerged as the dominant paradigm for Diffusion Transformers (DiT), where a pretrained autoencoder compresses image pixels into a latent space to facilitate the diffusion process. Recently, the use of semantic encoders within autoencoders (AEs) has gained attention, yet their influence on image reconstruction and diffusion model training remains insufficiently explored. In this study, we perform an indepth examination of how semantic encoders shape latent representation learning for the autoencoders. Our findings reveal a fundamental trade-off: while semantic encoders generate latent spaces enriched with visual semantics, their high level of abstraction makes it challenging to capture fine-grained geometric relationships, thereby requiring larger models and longer training for convergence. To address this issue, we build upon recent advances in representation learning that enable the joint modeling of both semantic abstraction and geometric detail. This leads to a Semantic Auto-Encoder (S-AE) that achieves state-of-the-art performance, combining superior reconstruction quality and discriminative capability. Specifically, with S-AE, we are able to provide a unified latent space that achieves 0.06 FID for image reconstruction and 81.9% classification accuracy on ImageNet, set a state-of-the-art benchmark. Codes and model weights will be made publicably available.
Paperid: 2441,   Poster  
Authors: Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath
Title: OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
Abstract: Incremental openvocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks. The source code will be made publicly available.
Paperid: 2442,   Poster  
Authors: Yutong Chen, Yiming Wang, Xucong Zhang, Sergey Prokudin, Siyu Tang
Title: GGPT: Geometry-Grounded Point Transformer
Abstract: Recent feedforward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas.Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings. All code and pre-trained models will be released publicly.
Paperid: 2443,   Poster  
Authors: Haoru Tan, Wang Wang, WU Sitong, Xiuzhe Wu, Yangtian Sun, Chirui Chang, Shaofeng Zhang, Xiaojuan Qi
Title: Dataset Distillation via Influence Matching
Abstract: We revisit dataset distillation from an outcomecentric perspective. Rather than aligning process surrogates (per-step gradients or training trajectories), Influence Matching (Inf-Match) aligns the final outcome of training: it learns a compact synthetic set whose effect on the converged parameters matches that of the full dataset. Concretely, we introduce a fully differentiable, sample-level influence estimator that quantifies parameter shifts from adding or removing data-- without time-consuming inverse-Hessian products or convexity assumptions. The estimator runs in linear time by unrolling the optimization dynamics and applying a first-order Taylor approximation. We then learn the synthetic set by minimizing the mismatch between its influence and that of the real dataset, yielding outcome alignment rather than heuristic process imitation.Inf-Matchdelivers the best accuracy across standard classification benchmarks. For instance, on Tiny-ImageNet (IPC=10),Inf-Matchattains 31.5%, a +4.7% improvement over NCFM. Beyond classification,Inf-Matchscales to vision-language distillation on Flickr30K, outperforming strong process-matching baselines. For instance, with 200 to 1000 synthetic samples, our method achieved a leading impressive average on image/text retrieval tasks, higher than NCFM by 2.5%.
Paperid: 2444,   Poster  
Authors: Beining Han, Yu-Wei Chao, Erwin Coumans, Clemens Eppner, Jia Deng, Stan Birchfield, Adithya Murali
Title: GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
Abstract: We study crossembodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 395 Million grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.
Paperid: 2445,   Poster  
Authors: Ashish Kumar, A. N. Rajagopalan
Title: $L^{2}DGS$: Low-Light Dynamic Gaussian Splatting
Abstract: Synthesizing novel spatiotemporal views of dynamic scenes is inherently challenging due to both object and camera motion, as well as sparsity of observations. Recent advances in Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have enabled 4D dynamic scene reconstruction, but predominantly from welllit images or videos. Some works address the problem of reconstructing a well-lit scene from low-light input, but these are limited to static scenes. Moreover, prior methods primarily emphasize improving illumination, while overlooking the underlying scene characteristics. Reconstructing well-lit dynamic scenes from inputs captured under low-light conditions is particularly challenging due to shadows, occlusions, and disocclusions caused by object motion, which makes the problem highly ambiguous and ill-posed. We propose L^2DGS (Low-Light Dynamic Gaussian Splatting), a self-supervised 4D GS framework for directly reconstructing well-lit dynamic scenes from low-light inputs. The proposed method decomposes each scene into two complementary components: illumination, which varies across both view and time, and reflectance, which remains invariant to these factors. To achieve this, we introduce several key innovations. First, the proposed Occlusion-Disocclusion Network (OCD-Net) models time-varying intensity across frames. Next, we propose Brightness Attenuation Features (BAFs), when complemented by the BAF Enhancement Network (BAFE-Net), enable geometry- and photometry-aware transformation between well-lit and low-light scenes for self-supervision. Together, these components allow L^2DGS to maximize signal strength and suppress noise inherent in low-light inputs, leading to enhanced spatial fidelity and temporal consistency under challenging illumination conditions. Our method operates on standard sRGB inputs without requiring camera metadata (e.g., exposure settings), ensuring compatibility with consumer-grade imaging devices. We evaluate L^2DGS on both simulated and real-world Low-Light Dynamic Video (L^2DyV) datasets, demonstrating superior qualitative and quantitative performance.
Paperid: 2446,   Poster  
Authors: ZiHao Xu, Dawei xu, Zihan Li, Xixi Zheng, Chuan Zhang
Title: GVIS: Generative Vector Image Steganography
Abstract: Vector images have attracted increasing attention in the field of information hiding in recent years due to their scalability and structural properties. However, existing steganographic methods for vector images often introduce noticeable modifications to the files themselves, resulting in potential security risks and limited embedding capacity. Motivated by recent advances in diffusion models and image generative steganography, we propose GVIS, a novel Generative Vector Image Steganography framework. GVIS deterministically generates bitmap images using diffusion models, which are subsequently vectorized into scalable vector images. On the sender side, we design a lightweight overlap detection algorithm to identify Bézier curve control points suitable for data embedding, which enables the secret information to be encoded into the polar coordinate parameters of these control points. Then, the receiver can use the preshared conditional inputs to reconstruct the generation process and accurate message extraction by vector difference. Extensive theoretical analysis and experimental results demonstrate that GVIS achieves high-capacity, high-accuracy, secure, and training-free steganography. To the best of our knowledge, this is the first work to introduce generative model into the domain of vector image steganography.
Paperid: 2447,   Poster  
Authors: Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taixe, Christian Rupprecht, Daniel Cremers, Stefan Roth
Title: Scene-Centric Unsupervised Video Panoptic Segmentation
Abstract: Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose CUViPS, the first unsupervised VPS approach. CUViPS generates temporally consistent panoptic video pseudolabels from monocular scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate and unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. CUViPS consistently outperforms all baselines and demonstrates strong label-efficient learning. With CUViPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
Paperid: 2448,   Poster  
Authors: Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen, Jianbo Zhang, Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Guangtao Zhai
Title: A³: Towards Advertising Aesthetic Assessment
Abstract: Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we presentA³ (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A³Law), a dataset (A³-Dataset), a multimodal large language model (A³-Align), and a benchmark (A³-Bench). Central to A³ is a theory-driven paradigm, A³-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A³-Law, we construct A³-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A³-Align, trained under A³-Law with CoT-guided learning on A³-Dataset. Extensive experiments on A³-Bench demonstrate that A³-Align achieves superior alignment with A³-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment.
Paperid: 2449,   Poster  
Authors: Chenfan Qu, Yiwu Zhong, Xuekang Zhu, Junchi Li, Changjiang Jiang, Jian liu, Lianwen Jin
Title: Detect Any AI-Counterfeited Text Image
Abstract: The rapid advancement of generative AI enables the creation of highly realistic text images, posing significant security risks from fraud and disinformation. However, research into robust detection is critically hampered by existing datasets that lack scale, diversity, and updated counterfeit techniques, as well as by models that fail to generalize. To address these deficiencies, we introduce DanceText, a largescale, comprehensive dataset for AI-counterfeited text image detection. Constructed using our novel Creative Proposer pipeline, which automates the generation of diverse and realistic counterfeits, DanceText surpasses previous works by over 100-fold in multiple dimensions. It is the first to include counterfeits from multimodal large models, commercial software, and mobile apps, covering all major paradigms, including full-image generation, regional removal, and editing. Building on this dataset, we propose DS-Net, a novel and effective detection model. It features two key components: a Forensic Decoupling Encoder to extract generator-agnostic artifact features, and a Synergy Denoising Decoder that synergizes image-level classification with instance-level localization. Extensive experiments demonstrate that DS-Net achieves state-of-the-art performance, advancing the field toward robust and unified detection of AI-counterfeited text images in real-world scenarios. Both our code and dataset will be released publicly.
Paperid: 2450,   Poster  
Authors: Sohyun Lee, Yeho Gwon, Lukas Hoyer, Konrad Schindler, Christos Sakaridis, Suha Kwak
Title: Robust Promptable Video Object Segmentation
Abstract: The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safetycritical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark will be made publicly available.
Paperid: 2451,   Poster  
Authors: weiqi Huang, Shuangyi Dong, Jiaxin Li, YifeiGuo YifeiGuo, Zan Wang, Wei Liang
Title: FloVerse: Floor Plan-Guided Multi-Modal Navigation
Abstract: Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan–guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan–guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support this FloVerse, we assemble FloVerse1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy consisting of a planner—a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling—and a refiner, a depth-based trajectory refinement module for safe execution. Extensive experiments show that (1) floor plan priors consistently improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly learns to infer goal locations from diverse goal representations through spatial reasoning. These results highlight the effectiveness of structured spatial priors and our unified approach for floor plan–guided embodied navigation.
Paperid: 2452,   Poster  
Authors: Chen Yin, Xingbo Dong, Xuelin Shen, Zhe Jin
Title: Spectral Mixture-of-Experts for Continual Learning
Abstract: While ParameterEfficient Fine-Tuning using Mixture-of-Experts (MoE) is a promising solution for continual learning (CL), it suffers from two critical failure modes: structural interference, where expert updates interfere, and compositional forgetting, where the model’s routing policy drifts. To address these issues, we introduce Spectral MoE, a novel framework built for CL from three core components. First, Spectral Experts are parameterized using unique, disjoint spectral masks to confine their learnable parameters to distinct frequency subspaces, ensuring a priori orthogonal updates that prevent structural interference. Second, a Dual-Router mechanism decouples online routing that learns new tasks from an offline memory that archives historical expert importance. Finally, this offline memory enables a Dynamic Consistency Projection, a geometric constraint that suppresses router drift and adaptively shields experts based on their past contributions, mitigating compositional forgetting. Validated on a strict cross-domain CL benchmark, our framework significantly outperforms existing methods, demonstrating superior knowledge retention and plasticity for new tasks. Code will be released upon acceptance.
Paperid: 2453,   Poster  
Authors: Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Kang Wei, Jingcai Guo
Title: SG-LoRA: Semantic-guided LoRA Parameters Generation
Abstract: Generating new LowRank Adaptation (LoRA) weights from pre-trained LoRAs has demonstrated strong generalization capabilities across various tasks, enabling the efficient transfer of AI models, particularly on resource-constrained edges. However, previous studies either merge base LoRAs via weighting coefficients or train a generative model under the closed-world assumption, limiting their efficiency and flexibility in complex edge user cases. This challenge may further increase when there are significant domain shifts between training and deployment. To this end, we propose Semantic-Guided LoRA Parameter Generation (SG-LoRA), a tuning-free generative framework to efficiently produce task-specific parameters for unseen tasks in a semantic-to-LoRA pipeline. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts, while also offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work. Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. The code is attached in the supplementary material.
Paperid: 2454,   Poster  
Authors: Yunbei Zhang, Chengyi Cai, Feng Liu, Jihun Hamm
Title: Alternative Reprogramming for Service Models
Abstract: Adapting closedbox service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.
Paperid: 2455,   Poster  
Authors: Tong Lin, Yifan Bai, Shiyi Liang, Ruigang Niu, Xing Wei
Title: Adaptive Capacity Autoregressive Visual Tracking
Abstract: We present ARTrackAC, a new step in the autoregressive tracking paradigm that introduces adaptive capacity inference to achieve both temporal consistency and dynamic efficiency. While existing autoregressive trackers predict object states sequentially with fixed inference capacity, they fail to accommodate the fluctuating temporal difficulty of real videos. ARTrack-AC addresses this limitation by equipping the tracker with the ability to modulate its inference capacity over time. A diffusion-based difficulty estimator anticipates the stability of upcoming segments, guiding a controller to switch between an accurate (high-capacity) and an efficient (low-capacity) mode while maintaining autoregressive consistency. This system-level autoregression extends conventional sequence modeling beyond “what to predict” toward “how to predict,” forming a self-regulated tracking process that aligns inference cost with temporal complexity. Despite its simplicity, ARTrack-AC achieves state-of-the-art accuracy–speed trade-off on major benchmarks—66.7% AUC on LaSOT and 47.5% AUC on LaSOText—running 2.9× faster than its predecessor.
Paperid: 2456,   Poster  
Authors: Hani Alomari, Ali Asgarov, Chris Thomas
Title: Lenses: Toward Polysemous Vision–Language Understanding
Abstract: Most visionlanguage models assume images have a single literal meaning, even though images are polysemous. We propose a retrieval paradigm that models many-to-many relationships between images and text using interpretive lenses and introduce Lenses, a multi-prompt embedding model and dataset for polysemous image-text retrieval. The Lenses dataset contains (105,669) images and (732,405) captions, with each image paired with multiple captions and image-side prompts annotated across five categories: Literal, Figurative, Emotional, Abstract, and Background. Building on a multimodal large language model, the Lenses model uses learned lens tokens to extract lens-specific embeddings for every image and caption and compares these using a lens-masking similarity function with a global fallback that prioritizes same-lens matches while retaining a global pathway. Training uses a category-aware multi-positive contrastive loss and intra-set diversity regularization to align corresponding perspectives while preventing semantic collapse across lenses. We further propose lens-aware evaluation protocols, including category-aware ranking, that better reflect how humans match images and text. Experiments on the Lenses dataset and public benchmarks show that our model outperforms baselines on literal and non-literal retrieval and reduces over-reliance on literal cues.
Paperid: 2457,   Poster  
Authors: Linfei Pan, Johannes Schönberger, Marc Pollefeys
Title: Global Structure-from-Motion Meets Feedforward Reconstruction
Abstract: Structurefrom-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Extensive experiments over a wide range of reconstruction scenarios demonstrate the benefits of our approach by achieving state-of-the-art results across the board.The implementation of our pipeline will be shared as open source software.
Paperid: 2458,   Poster  
Authors: Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, James Rehg
Title: Gaze Target Estimation with Concepts
Abstract: Estimating human gaze targets from images inthe-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks. To overcome these liimtations, we introduce thePromptable Gaze Target Estimation (PGE)task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person in point [0.52, 0.48]") to identify a specific subject for gaze analysis. This approach integrates subject localization with gaze estimation, and eliminates the rigid dependency on intermediate analysis stages. We develop a scalable data engine to generateGaze-Co(Gaze Estimation with Concepts), a dataset and benchmark of 120K high-quality, prompt-annotated image pairs. We also proposeAnyGaze, the first model designed for PGE. AnyGaze uses a transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation. AnyGaze achieves state-of-the-art performance on multiple PGE benchmarks, setting a strong baseline for this new problem even on a difficult out-of-domain, real-world clinical dataset. We will open-source the AnyGaze model and the Gaze-Co benchmark.
Paperid: 2459,   Poster  
Authors: gao ya, Shihao Li, ZhaoJun Liu, AIHUA ZHENG, Chenglong Li, Jin Tang
Title: Chain-of-Thought Guided Multi-Modal Object Re-Identification
Abstract: With the rise of visuallanguage models, multi-modal ReID retrieves specific targets by integrating different spectra and textual descriptions. Existing methods merely adopt descriptive representation learning for image-text, ignoring the relationships among the intrinsic logical hierarchies of semantic features. Since Chain-of-Thought (CoT) can provide textual logical context and enhance semantic perception in large-model reasoning, we propose CoT-ReID, a CoT-guided framework that injects the Multi-modal Large Language Models (MLLMs) reasoning into multi-modal ReID. Specifically, we simulate the joint visual-textual logical decision-making of human reasoning, leveraging CoT textual logical reasoning to guide visual feature learning at the early, late, and decision-making level: At the early level, we embed the semantic reversion of CoT hierarchical reasoning into visual features to calibrate bottom-level features and emphasize visual hierarchical reasoning. Next, we take CoT hierarchical reasoning text as an anchor condition to constrain the consistency of visual cross-modal semantics. Finally, through the hierarchical reasoning process of CoT, we embed logically reasoned text attribute features into multi-modal decision-making, providing logical support for selecting discriminative identity features. By constructing CoT textual benchmarks and our proposed modules, our framework generates more robust multi-modal features in complex scenarios. Comprehensive experiments on four datasets (RGBNT100, MSVR310, WMVeID863, RGBNT201) demonstrate that our method outperforms existing approaches. Code will be released upon acceptance.
Paperid: 2460,   Poster  
Authors: Yipeng Gao, Yunhao Ge, Peilin Cai, Daniel Seita, Laurent Itti
Title: LAM: Language Articulated Object Modelers
Abstract: We introduce LAM, a system that explores the collaboration of largelanguage mod-els and vision-language models to generate articulated objects from text prompts.Our approach differs from previous methods that either rely on input visual structure(e.g., an image) or assemble articulated models from pre-built assets. In contrast,we formulate articulated object generation as a unified code generation task, wheregeometry and articulations can be co-designed from scratch. Given an input text,LAM coordinates a team of specialized modules to generate code to represent thedesired articulated object procedurally. The LAM first reasons about the hierarchi-cal structure of parts (links) with Link Designer, then writes code, compiles it, anddebugs it with Geometry & Articulation Coders and self-corrects with Geometry& Articulation Checkers. The code serves as a structured and interpretable bridgebetween individual links, ensuring correct relationships among them. Representingeverything with code allows the system to determine appropriate joint types andcalculate their exact placements more reliably. Experiments demonstrate the powerof leveraging code as a generative medium within an agentic system, showcasingits effectiveness in automatically constructing complex articulated objects.
Paperid: 2461,   Poster  
Authors: Shihua Zhang, Qiuhong Shen, Xinchao Wang
Title: Align Images Before You Generate
Abstract: Multiimage diffusion models can generate images like multi-views or videos to describe static or dynamic scenes, yet texture and structure drift persist, severely undermining the spatiotemporal consistency. Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. Specifically, CorrAdapter designs a bypass branch for transformer blocks in the multi-image diffusion model, encompassing a native correspondence constructor that builds reliable correspondences from the diffusion model's intermediate features, and an aligned area aggregator that integrates messages from only matching regions to avoid ambiguous information interactions. Given the native correspondences as guidance, CorrAdapter can enhance spatiotemporal consistency without any auxiliary inputs, and remains training-free and baseline-agnostic, which enables it to generalize seamlessly to various generation tasks. Additionally, we provide an optional training scheme to explore further-improved possibilities. Experiments on both static multi-view generation and dynamic video generation show that CorrAdapter consistently improves spatiotemporal consistency and perceptual quality over strong baselines, offering a simple yet versatile drop-in approach to geometrically faithful multi-image diffusion.
Paperid: 2462,   Poster  
Authors: Zhengling Wu, Rongfeng Lu, Quan Chen, Longjian Zeng, Ming Lu, Yaoqi Sun, Yahong Chen, Baofeng Ji, Chenggang Yan
Title: GS-ASM: 2DGS-Supervised Active Stereo Matching
Abstract: Due to the lack of ground truth, existing methods of active stereo matching generally employ fully selfsupervised learning to produce precise depth estimates. Although they can achieve promising results, their performance still has a noticeable gap compared with supervised models. To fill this gap, we propose a novel framework that synthesizes proxy labels to enable supervised training of deep active stereo networks without requiring any ground-truth depth. To expand the training data and generate disparity proxy labels, we develop an active 2D Gaussian Splatting (2DGS)-based synthesis method that explicitly models the scene geometry and the projected active pattern. Furthermore, to balance the varying contributions of different supervisions during training, we design a hybrid supervision regularization strategy that dynamically adjusts the loss weights to achieve stable optimization. We also contribute a real-world dataset captured by a handheld RealSense camera, along with our active 2DGS model, which facilitates future research on active stereo matching. Extensive qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance on active stereo matching task. The code and dataset will be publicly released.
Paperid: 2463,   Poster  
Authors: Rui Zhao, Mike Zheng Shou
Title: P-Flow: Prompting Visual Effects Generation
Abstract: Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearancedriven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks.
Paperid: 2464,   Poster  
Authors: SUWAN LEE, Jo Ryeong Yim, Kibaek Park, Dong Kim, Eunhyeuk Kim, Minsup Jeong, Chae Sim, Seokju Lee
Title: LNEM: Lunar Neural Elevation Model
Abstract: Highresolution and high-precision digital elevation models (DEMs) of the lunar surface are essential for landing site selection and lunar geological research. However, traditional stereo matching techniques provide limited representation of 3D scene, struggling with non-textured regions and extreme light variations. Furthermore, recent lunar neural rendering methods are ill-suited for 3D reconstruction due to their reliance on simple pinhole approximations for pushbroom sensors. These challenges are compounded by inconsistencies introduced during satellite image processing, including geometric misalignment, distributional bias, and labor-intensive handcrafted operations. To address these issues, we introduce the Lunar Neural Elevation Model (LNEM), a volumetric reconstruction method that explicitly incorporates the pushbroom imaging model. A core component of our approach is the Lunar Studio dataset, processed using Rigorous Sensor Models (RSMs) to ensure geometric consistency of multi-orbit Lunar Reconnaissance Orbiter Camera (LROC) Narrow Angle Camera (NAC) and Korea Pathfinder Lunar Orbiter (KPLO) Lunar Terrain Imager (LUTI) images. LNEM integrates this RSM-based pushbroom camera formulation with learned shadow modeling, enabling physically grounded volumetric rendering under challenging lunar illumination. Extensive experiments demonstrate that LNEM achieves geometrically consistent reconstruction and cross-sensor generalization under diverse viewing and lighting conditions, providing a scalable and physically informed alternative to conventional DEM pipelines. To facilitate reproducibility and future lunar research, we release Lunar Studio, its multi-orbit dataset, and the LNEM reconstruction pipeline.
Paperid: 2465,   Poster  
Authors: Martin Nicolas Everaert, Xiruo Liu, Hiroyuki Takeda, Raja Bala, Vivek Yadav, Vidya Narayanan
Title: Visual Grounding for Object Questions
Abstract: Current visual grounding research remains limited for practical applications, because existing techniques primarily focus on direct visual queries (e.g., "find the red car") or reading visible text (e.g., "what is the title of this book?"), rather than supporting general questions about objects (e.g., "how comfortable are these earbuds?"). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous work that grounds only what is directly visible in images, VGOQ handles openended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this challenging task, we develop two automated data generation techniques combining existing models and data, and create two new datasets: ABO-VGOQ and VizWiz-VGOQ.We show that the data can be used to train a lightweight visual grounding model, and evaluate it against state-of-the-art approaches. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SOTA visual grounding performance decreases from 29.2%-52.2% gIoU to 22.6%-37.2% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (VizWiz-VGOQ, segmentation of visual evidence). On our new ABO-VGOQ dataset, our lightweight model achieves 39.5% gIoU, while current SOTA visual grounding approaches achieve only 12.4%-19.3%.
Paperid: 2466,   Poster  
Authors: Zhiguo Yang, Dongsheng Xu, Ruizhi Zhong, Jiacheng Pi, Xingxing Huang, Wenjie Ruan
Title: Logit-Margin Repulsion for Backdoor Defense
Abstract: Backdoor attacks are an increasingly significant threat to deep neural networks. Recent studies have revealed that model compression, such as quantization and pruning, can be exploited to implant conditional backdoors. These backdoors remain dormant in fullprecision models but are activated during the compression, making them highly stealthy and difficult to detect. Traditional defense methods are generally ineffective against such attacks, and defenses designed for conditional backdoors struggle to handle traditional ones. Moreover, most existing approaches fail to generalize to Transformer architectures.To address these challenges, we propose Logit Margin Repulsion (LMR), a universal and architecture-agnostic defense method. LMR uses a small set of clean samples, combining selective cross-entropy with a logit-margin constraint to enlarge the gap between the backdoor class and benign classes. It then applies selective pruning to remove channels associated with backdoor behavior, achieving strong defense without changing the model architecture. Extensive experiments on a wide range of CNNs and Vision Transformers demonstrate that LMR, even with a minimal amount of clean data (0.1%), can effectively mitigate both traditional and conditional backdoor attacks across diverse model architectures.
Paperid: 2467,   Poster  
Authors: Peter Kulits, Cordelia Schmid
Title: BrickNet: Graph-Backed Generative Brick Assembly
Abstract: We train a language model to generate LEGO®brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet
Paperid: 2468,   Poster  
Authors: An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Danny Yin, Sifei Liu
Title: Grounded 3D-Aware Spatial Vision-Language Modeling
Abstract: We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities—explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding—within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chainof-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.
Paperid: 2469,   Poster  
Authors: Tingle Li, Siddharth Gururani, Kevin Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu
Title: Benchmarking Single-Factor Physical Video-to-Audio Generation
Abstract: Generative videoto-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data.
Paperid: 2470,   Poster  
Authors: Seungeun Lee, SeungJun Moon, Hah Min Lew, Ji-Su Kang, Gyeong-Moon Park
Title: Personalized Audio-driven Whole-body Talking Avatars
Abstract: Prior conversational 3D avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio–motion synchronization and suppresses microarticulations critical for realism—such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures—especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motion. A splat-based differentiable renderer preserves identity, texture, and multi-view realism, while feature-level distillation from a large audio-driven video diffusion model and weak supervision from synthetic audio-conditioned clips further improve synchronization and natural expressivity. Joint photometric and temporal objectives shape the audio-conditioned deformation and rendering. Experiments across diverse speakers show improved lip–audio sync, fine facial detail, and conversational gesture naturalness over pose-driven baselines, while preserving identity from a single photo and supporting photorealistic novel-view synthesis.
Paperid: 2471,   Poster  
Authors: Jintae Kim, Keunsoo Ko, Chang-Su Kim
Title: Frequency-domain Manipulation for Face Obfuscation
Abstract: Facial image datasets have become essential resources for various face analysis tasks, but their use raises significant privacy concerns. To address this issue, face obfuscation has emerged as a practical approach to hide identity from humans while retaining cues decipherable by machines. However, existing methods often leave exploitable visual traces, making them vulnerable to reconstruction attacks that restore hidden identity. To address this issue, we propose a frequencydomain manipulation framework, called FreM, which adjusts frequency subbands differently to hide identity, retain machine-decipherable cues, and improve robustness against reconstruction attacks. Specifically, the proposed FreM first decomposes a facial image into frequency subbands and applies subband-adaptive modulation that regulates information according to the characteristics of each subband. The modulation parameters are then refined to yield the reliable obfuscated result. Extensive experiments across multiple face analysis benchmarks demonstrate that FreM achieves superior obfuscation quality and strong robustness against reconstruction attacks. The source code will be made publicly available.
Paperid: 2472,   Poster  
Authors: Hong Li, Chongjie Ye, Houyuan Chen, Weiqing Xiao, Ziyang Yan, Lixing Xiao, Zhaoxi Chen, Jianfeng XIANG, Shaocong Xu, Xuhui Liu, Yikai Wang, Baochang Zhang, Xiaoguang Han, Jiaolong Yang, Hao Zhao
Title: NeAR: Coupled Neural Asset–Renderer Stack
Abstract: Neural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an endto-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility withNeAR: a Coupled Neural Asset–Renderer Stack. On theassetside, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On therendererside, we design a lighting-aware neural renderer that uses this neural asset, along with explicit view embeddings and HDR environment maps, to produce lighting-aware renderings in realtime. We validate NeAR on four tasks: (1) G-buffer–based forward rendering, (2) random-lit single-image reconstruction, (3) unknown-lit single-image relighting, and (4) novel-view relighting, where our coupled stack surpasses state-of-the-art baselines in quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires new graphics stacks that view neural assets and renderers as co-designed components instead of independent ones.
Paperid: 2473,   Poster  
Authors: Runduo Han, Hongchen Tan
Title: Cross-Subject EEG-to-Video Reconstruction and Beyond
Abstract: Reconstructing video content from EEG (electroencephalogram) is a research task of significant scientific importance. However, due to intersubject differences in physiological states and variations in signal acquisition configurations, this task faces the challenge of inconsistent cross-subject generation.To address this, we propose a Subject Adversarial and Mapping Network (SAM-Net). In SAM-Net, we first introduce a Hybrid Region-Temporal (HRT) Encoder to conduct inter-channel semantic interactions guided by brain regions and aggregate temporal semantics across different time scales. Secondly, we propose a Centered-progressive Subject Adversarial (C-SA) Mechanism to gradually narrow the metric distance between different subjects, thereby obtaining a unified and stable semantic representation. Thirdly, we design a New2Source Mapper to align the EEG distribution of new subjects with that of multiple known subjects. Finally, we adopt a keyframe-guided continuous semantic generation paradigm to drive the production of coherent and high-quality videos. Extensive experiments validate the competitive performance of our SAM-Net in cross-subject EEG-to-Video generation tasks, as well as its excellent performance in generation tasks involving new subjects.
Paperid: 2474,   Poster  
Authors: Stefan Spiss, Joey Hieronimy, Marcel Ritter, Matthias Harders
Title: ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
Abstract: This work presents ODGSSLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding,e.g., for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and outdoor), employing a custom developed virtual camera lens.An extensive evaluation shows that, for camera tracking, the proposed method achieves statistically significant lower ATE RMSE scores compared to a recent omnidirectional SLAM system, as well as other 3DGS-SLAM frameworks, while reaching a similar mapping performance.
Paperid: 2475,   Poster  
Authors: Hongyu Liu, Xuan Wang, Zijian Wu, yating wang, Ziyu Wan, Yue Ma, Runtao Liu, Boyao Zhou, Yujun Shen, Qifeng Chen
Title: AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
Abstract: We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoderonly Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code and data to inspire future research.
Paperid: 2476,   Poster  
Authors: Gaoxiang Luo, Frank Cole, Sihang Zhang, Yuxiang Wan, Yulong Lu, Ju Sun
Title: Flow Matching for Multimodal Distributions
Abstract: Visual foundation models play an increasingly important role in the training efficiency of flowbased models by inducing structured latent space through alignments, distillations, adapters, and even replacements of visual encoders. When structured latent space improves training efficiency by lowering the complexity of the target (latent) distribution, the efficiency can be further boosted by a data-adaptive multimodal source (noise) distribution that globally shortens the distance to the target (latent) distribution, and a mode-dependent coupling between source and target samples to move probability mass locally. To this end, we propose an efficient source and coupling co-design algorithm termed Mixture-Modeling Flow Matching (MM-FM). Under linear conditional flow objective and multimodal target assumption, our theoretical results reveal straighter and shorter sampling trajectories and smaller Lipschitz constant for learning complexity relative to isotropic Gaussian with independent coupling. In our ImageNet256x256 experiments with multimodal DINOv2-B latents, we observe superior convergence and state-of-the-art unconditional generation FID=2.74 with autoguidance in only 80 epochs.
Paperid: 2477,   Poster  
Authors: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, ziang yan, Yali Wang, Yi Wang, Limin Wang
Title: InternVideo-Next: Towards World-Understanding Video Models
Abstract: Largescale video–text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks.We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning.To address these, we disentangle the traditional encoder–decoder design into an Encoder–Predictor–Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model.First, conventional linear decoder in pixel MVM enforces the predictor’s output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction.Our Stage 1 proposes a conditional diffusion decoder and injects clean image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction.Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning.Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.
Paperid: 2478,   Poster  
Authors: Fatimaelzahraa Ahmed, Zhihe Lu, Gianni Di Caro, Diram Tabaa, Mohamed Hamdy, Muraam Abdel-Ghani, Abdulaziz Al-Ali, Muhammad Arsalan, Shidin Balakrishnan
Title: Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
Abstract: We present BRDG, a boundaryresponsive differentiable gating superpixel framework designed to resolve the trade-off between computational efficiency and segmentation precision in surgical scenes. At its core, the architecture is organized into three cooperative agents within a fully differentiable backbone. The Region Creator agent converts dense features into learnable superpixel tokens, jointly learning region descriptors and dense context. The Boundary Detector agent acts as a gating mechanism, estimating boundary confidence from region features to predict where refinement is needed. The refinement agent uses this gate to selectively fuse efficient coarse predictions with a high-fidelity refinement path that restores pixel-level details. To further enhance distinctiveness, an adjacency-boosted contrastive loss mines hard negatives across neighboring regions to improve boundary separation. We evaluate BRDG on three surgical tasks requiring high-precision EndoVis18-parts, EndoVis18-tools, EndoVis17-tools, as well as general domain benchmarks. Our model improves mIoU by substantial margins 4.5-7.0 over strong pixel-wise baselines while raising Boundary-F1 scores by over 10 points. Under the same hardware (RTX-A6000 Pro), it reaches 150.25 FPS with only 24M parameters. This makes it 10 times faster and 3.5 smaller than current state-of-the-art models, effectively resolving the critical accuracy–efficiency trade-off in real-time segmentation.
Paperid: 2479,   Poster  
Authors: Sihao Li, Baixi Baixi, Shuohong Xia, Yunyun Yang
Title: MapRoute:Precise-Concept Erasing Mappers via Semantic Routing
Abstract: Contemporary commercial and opensource diffusion models have demonstrated remarkable performance in text-to-image generation, enabling widespread applications in creative design and content creation. However, legitimate requirements—such as copyright protection, privacy compliance, or personalized customization—often necessitate the removal of specific semantic concepts from pretrained models. Existing concept erasure methods suffer from two critical limitations: (1) Incomplete suppression, where the model still occasionally generates images containing the target concept; (2) Poor semantic selectivity, which degrades the generation quality of unrelated concepts and compromises overall model utility.To address these challenges, we propose `MapRoute`, a lightweight, semantics-aware concept erasure framework based on dynamic routing. Our approach introduces a set of modular components—termed Mappers—placed after a frozen pretrained text encoder. Each Mapper learns a linear mapping from a target concept to a surrogate concept. During inference, the system dynamically activates the top-K Mappers most relevant to the input prompt, based on cosine similarity between the text embedding and all the target concept embeddings, and applies their transformations sequentially. This input-driven, modular intervention enables precise, on-demand erasure while avoiding unnecessary interference with irrelevant semantics.Extensive experiments demonstrate that `MapRoute` effectively suppresses specified concepts while significantly reducing collateral damage to unrelated concept. By operating without full-model fine-tuning, our method entirely avoids parameter drift and concept erosion. Moreover, `MapRoute` outperforms state-of-the-art baselines in terms of generation fidelity, semantic consistency, and scalability to multi-concept erasure scenarios.
Paperid: 2480,   Poster  
Authors: Hongxiang HUANG, Hongwei Ren, Xiaopeng LIN, Yulong Huang, Zeke Xie, Bojun Cheng
Title: DIMOS: Disentangling Instance-level Moving Object Segmentation
Abstract: Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective crossmodal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.
Paperid: 2481,   Poster  
Authors: Yankuan Chi, Xiang Li, Zixuan Huang, James Rehg
Title: Vinedresser3D: Towards Agentic Text-guided 3D Editing
Abstract: Textguided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the parts to edit and the edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.
Paperid: 2482,   Poster  
Authors: Chen Geng, Guangzhao He, Yue Gao, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
Title: NeuROK: Generative 4D Neural Object Kinematics
Abstract: Datadriven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this framework across diverse dynamic object types, showing clear advantages over prior works.
Paperid: 2483,   Poster  
Authors: Jinsong Zhang, Ying Qu, Yuan Liao, Hairong Qi, Zhenzhou Shao
Title: Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
Abstract: Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multisource images. To address these challenges, we propose a semantic-adaptive diffusion framework for dynamic spatiotemporal fusion (SA-STF), which constrains the solution space using low-resolution and high-frequency features decoupled via a Taylor-inspired decoder. By incorporating temporal feature alignment and semantic-adaptive fusion modules, SA-STF projects multimodal images with temporal dynamics into a unified latent space, and adaptively enhances spatial details while maintaining the spectral consistency of the reconstructed images. Experiments on benchmark datasets demonstrate that SA-STF outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex and dynamic scenes.
Paperid: 2484,   Poster  
Authors: Saiyang Na, Feng Jiang, Qifeng Zhou, Wenliang Zhong, Thao Dang, Yuzhi Guo, Hehuan Ma, Chunyuan Li, Weizhi An, Junzhou Huang
Title: Hyperbolic Gramian Volumes for Multimodal Alignment
Abstract: Multimodal contrastive learning typically relies on pairwise similarities for alignment, but recent work has shown that Gramian volumes can capture higherorder correlations across modalities.However, Euclidean Gramian volumes suffer from volume collapse under L2 normalization, concentrating near unity with minimal discriminative variance.Hyperbolic geometry's exponential volume growth naturally addresses this via variance preservation, motivating us to extend Gramian alignment to hyperbolic space.Yet preliminary experiments reveal that pure hyperbolic geometry alone is insufficient: while it preserves variance, it underperforms Euclidean baselines on cross-category discrimination.We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing.Using the numerically stable Lorentz model, HyperGRAM enables volumes to serve dual roles: discriminating matched from mismatched triplets while preservingsemantic sensitivity within matched pairs that reflects interpretation spaces (the set of valid multimodal realizations).Evaluation across four video-text benchmarks demonstrates that hybrid geometry consistently outperforms both pure Euclidean and pure hyperbolic variants, achievingsignificant zero-shot improvements with cross-dataset semantic sensitivity exhibiting contrasting correlation patterns.
Paperid: 2485,   Poster  
Authors: Itai Lang, Dongwei Lyu, Dale Decatur, Rana Hanocka
Title: Best Segmentation Buddies for Image-Shape Correspondence
Abstract: Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored problem of estimating segmentationto-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape.To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. We then identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D segmentation model of the image to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our code will be made publicly available.
Paperid: 2486,   Poster  
Authors: Yuze Cai, Jiahao Lu, Hongxiang Shi, Yichao Zhou, Hong Lu
Title: Prototype-Guided Concept Erasure in Diffusion Models
Abstract: Concept erasure is extensively utilized in image generation to prevent textto-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as ''sexual'' or ''violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably.To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.
Paperid: 2487,   Poster  
Authors: Junhoo Lee, Mijin Koo, Nojun Kwak
Title: Fingerprinting Diffusion models in the wild
Abstract: Textto-image models have become commercially valuable assets distributed under restrictive licenses to prevent unauthorized fine-tuning and redistribution, yet violations are only enforceable when detectable. Existing methods require pre-deployment injection or white-box access to model weights or gray-box access to intermediate activations—capabilities, which are unavailable in commercial API deployments. We present Compositional Semantic Fingerprinting (CSF), the first black-box fingerprinting method that attributes fine-tuned models to their base families using only query access to text-to-image generation APIs. CSF abstracts models as semantic category generators, probing them with compositional underspecified prompts that combine individually common components into exponentially rare compositions. Unlike traditional watermarking, this creates an asymmetric advantage: IP owners can cheaply generate novel prompt compositions at any time post-deployment, while attackers face the intractable challenge of anticipating and removing all possible fingerprints during training. We demonstrate this across 6 model families (FLUX, Kandinsky, SD1.5/2.1/3.0/XL) and 13 variants spanning comprehensive scenario. Our Bayesian attribution framework achieves >50% posterior mean accuracy with 95% credible intervals for all models.
Paperid: 2488,   Poster  
Authors: chenzhuo liao, Xin Chen, Bingchen Li, Yu Meng, Tao Yue, Xuemei Hu
Title: LRHDR: Learning Representation-enhanced HDR Video Reconstruction
Abstract: Reconstructing High Dynamic Range (HDR) video from alternately exposed Low Dynamic Range (LDR) frames is challenged by large motion, exposureinduced photometric inconsistency, and information loss in saturated or under-exposed regions. Prior HDR video pipelines typically follow an alignment–reconstruction paradigm, which is limited by the precision of alignment and the performance of the fusion module. We propose a new reconstruction framework called Learning Representation-enhanced HDR Video Reconstruction (LRHDR), which built around two novel components: an Amalgamated Cross-exposure Consistent Representation (ACCR) network and an Adaptive Pixel-wise Sparse Weighted Fusion (APSWF).The ACCR includes an Exposure-aware Interleaved Context (EIC) encoder and a Representation Mapper (RM).The EIC couples a large-field path with a high-fidelity sub-pixel path and an exposure gate to produce exposure-aware features. The RM avoids geometric warping by mapping features from different exposures into a unified representation via per-pixel, per-channel linear modulation and decoded into calibrated linear HDR domain. The APSWF treats fusion as pixel-wise candidate selection, producing sparse weighted masks to form a normalized fusion in the linear HDR domain, thereby suppressing artifacts.Extensive experiments on standard benchmarks demonstrate that our LRHDR outperforms previous methods.
Paperid: 2489,   Poster  
Authors: Chenchen Lin, Wenhao Yuan, Zhengji Xu, Xuehe Wang
Title: Domain-Aware Federated Learning via Fisher-Guided Pruning
Abstract: Federated Learning (FL) serves as a prominent distributed machine learning paradigm, enabling clients to collaboratively train a shared model. However, clients generally possess data from multiple domains, posing significant challenges to model efficiency and generalization. In this paper, we propose \textttFedFIP, a domainsensitive federated pruning framework that preserves domain-invariant structures and domain-specific representations. First, we design the Domain-Sensitive Fisher Pruning (DSFP) module to estimate channel importance per domain via Fisher information, and upload it to the server to obtain a globally shared pruning mask. Due to domain heterogeneity, each client reuses its Fisher information to selectively reactivate domain-specific channels, yielding personalized sparse models that remain structurally aligned yet adapt to local heterogeneity. To further enhance performance, we adopt a Domain-Sensitive Regularization (DSR) module, in which the server builds domain prototypes from important signals and broadcasts them back. Guided by the prototypes, we introduce a structure-contrastive loss to strengthen intra-domain consistency and inter-domain discriminability. Finally, we propose a structure-aware aggregation algorithm to fuse heterogeneous personalized architectures into a domain-generalized global model. Extensive experiments on multi-domain benchmarks demonstrate that \textttFedFIP surpasses state-of-the-art FL baselines while substantially shrinking model size.
Paperid: 2490,   Poster  
Authors: Zi-Hao Bo, Yaqian Li, Anzhou Hou, rinyoichi takezoe, Ertao Zhao, Tianxiang Pan, Jiale Yan, Mo Guang, Kaiwen Long
Title: Soft Modality-Guided Expert Specialization in MoE-VLMs
Abstract: Mixtureof-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.
Paperid: 2491,   Poster  
Authors: Norio Kosaka, Timothy Duff, Tomas Pajdla
Title: Minimal Constraint Relaxation for Multiview Autocalibration
Abstract: Polynomial systems in multiview geometry are often highly overconstrained, and naïve subsampling or elimination can lead to unstable or inconsistent estimation. We revisit this issue through the lens of \emphconstraint relaxation—the selective removal of equations to recover a finite and well-conditioned solution space. Focusing on the Kruppa equations for camera autocalibration, we introduce the notion of \emphminimal relaxation, a principled framework for identifying constraint subsets that preserve geometric validity while restoring solvability. Through symbolic analysis of the full three-view Kruppa system, we enumerate and classify all relaxation patterns, revealing algebraically minimal families that yield finite, well-conditioned problems.Comprehensive experiments validate this analysis across symbolic and numerical settings.Using homotopy continuation and synthetic perturbations, we show that specific relaxations remain stable under noise and permutation.Experiments with synthetic and real images further demonstrate that these relaxations consistently outperform the classical SVD-based Kruppa formulation in both robustness and calibration accuracy, establishing algebraic relaxation as a powerful paradigm for stable multiview autocalibration.
Paperid: 2492,   Poster  
Authors: jiacheng yang, Ruichi Zhang, Chikai Shang, Mengke Li, Xinyi Shang, Junlong Gao, Yonggang Zhang, Yang Lu
Title: Decision Boundary-aware Generation for Long-tailed Learning
Abstract: Longtailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating additional data, while head-to-tail transfer further mitigate the generator bias inherit from long-tailed dataset. However, we show that while head-to-tail transfer helps balance the decision space of the classifier, it also induces latent non-local feature mixing that entangles inter-class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary-aware Generation (DBG) framework, which promotes near-boundary representation learning by generating informative near-boundary samples. Overall, DBG rebalances the long-tailed dataset while yielding more separable decision space for long-tailed learning. Across standard long-tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter-class overlap. The code is available in the supplementary materials.
Paperid: 2493,   Poster  
Authors: zundong Ke, Junlin Chen, Jiayi Zhu, Kuanhao Xia, Jiayuan Gu, boyi zhao
Title: Explicit Recovery Behaivor for Diffusion Policies
Abstract: Diffusion policies have emerged as a powerful paradigm for robot learning, but their inherent multimodality can lead to a diverse set of plausible—though not always optimal—actions from a single observation. We posit that for a given task, an optimal action exists within this distribution. Inspired by negative prompting in generative models, we introduce a novel method that leverages an error detector to identify out-of-distribution (OOD) execution histories and uses them to construct negative action prompts. This allows our policy to steer away from suboptimal behaviors and converge towards higher-performance actions. We present a comprehensive ablation study demonstrating the effectiveness of positive, and negative prompts, and validate our approach on a suite of simulated benchmarks and real-world robotic tasks. Our results show that the proposed Negative-Prompt-guided Diffusion Policy achieves significant improvement in task performance by effectively filtering undesirable action modes.
Paperid: 2494,   Poster  
Authors: Nikesh Subedi, Loris Bazzani, Ziad Al-Halah
Title: Interactive Episodic Memory with User Feedback
Abstract: Human memory is often unreliable. We forget where we placed objects, overlook small details, and struggle to recall past events accurately. Episodic Memory with Natural Language Query (EMNLQ) seeks to overcome these limitations by allowing users to search their past visual experiences, captured through egocentric videos, using natural language questions. While recent models focus on addressing challenges in EM-NLQ like noisy input videos and efficiency, they overlook a key aspect of this task: interactivity. In real scenarios, users have the ability to refine their queries and provide feedback when a model's response is off-target, yet current EM-NLQ methods cannot incorporate or benefit from such feedback. To address this gap, we introduce the first interactive EM-NLQ framework, featuring a plug-and-play Feedback ALignment Module (FALM) that empowers existing models to efficiently incorporate user feedback and refine their predictions. Additionally, we introduce the Episodic Memory with Questions and Feedback task (EM-QnF) and new datasets tailored for feedback-based interaction and a lightweight training scheme that eliminates the need for expensive sequential optimization. Our approach, dubbed ReFocus, combines FALM with state-of-the-art EM-NLQ methods to achieve state-of-the-art results on three challenging benchmarks and demonstrates significant improvements in human-based feedback evaluation, bringing EM-NLQ closer to truly interactive and adaptive visual memory systems.
Paperid: 2495,   Poster  
Authors: Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan L. Yuille, Junfei Xiao
Title: Captain Safari: A Real-time World Engine
Abstract: World engines aim to synthesize long, 3Dconsistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. In a 50-participant human study, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide \emphOpenSafari as a challenging new benchmark for future world-engine research.
Paperid: 2496,   Poster  
Authors: yukun Song, Changwei Wang, xingtianPei xingtianPei, Shibiao Xu, Wenhao Xu, Shunpeng Chen, Yu Zhang, Ke Zhang, Rongtao Xu, Xuxiang Feng, Pengyang Wang
Title: DialogueVPR: Towards Conversational Visual Place Recognition
Abstract: Inspired by how humans communicate spatial information, languageguided geo-localization has gained significant traction for its intuitive and practical value. Despite this progress, most methods still rely on a static, one-shot retrieval paradigm, which fails to handle the ambiguity and incompleteness inherent in real-world natural language descriptions. We propose a paradigm shift to reasoning retrieval and introduce Dialogue Place Recognition (DlgPR), which casts localization as an interactive, dialogue-driven reasoning process. To support this new task, we present DlgQuest-Cities, the first large-scale dialogue-based benchmark for place recognition, and a unified reasoning framework that couples a cross-modal multi-level retriever with an intelligent questioner, DQ-pilot. DQ-pilot is trained in a curriculum: supervised fine-tuning on a curated DQ-cities-20k subset followed by reinforcement refinement on a harder DQ-cities-10k split via GRPO. Two task-aligned metrics guide learning: a Discriminative Difficulty Index (DDI) for curriculum sampling and a Positional Retrieval Gain (PRG) reward that directly measures retrieval improvement induced by a question. Experiments show this reasoning-based approach significantly outperforms baselines. The code will be made publicly available upon acceptance.
Paperid: 2497,   Poster  
Authors: Anh Quan Cao, TUANHUNG VU
Title: OccAny: Generalized Unconstrained Urban 3D Occupancy
Abstract: Relying on indomain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization.While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios.We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features.OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images.Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii)Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion.Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets.
Paperid: 2498,   Poster  
Authors: Zhiqiang Wu, Yitong Dong, Xian Wei
Title: TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution
Abstract: Diffusionbased generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g 2048^2) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g ×8) exceeding the model's native-supported upsampling ratio (e.g ×4), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling–Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at R-resolution, and the second introduces a looped chunk-based training strategy at NR-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of 1024^2 and even 2048^2, significantly outperforming existing approaches.
Paperid: 2499,   Poster  
Authors: Hao Zhou, Tiru Wu, yan jiang, Wanqi Zhou, Junxing Hu, Ai Han
Title: Hierarchical Attacks for Multi‑Modal Multi‑Agent Reasoning
Abstract: Multi‑modal multi‑agent systems (MM‑MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important.However, existing studies on adversarial attacks in multi‑agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM‑MAS largely underexplored. To bridge this gap, we introduce HAM\textsuperscript3, a Hierarchical Attack framework for multimodal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM\textsuperscript3mounts attacks by perturbing visual inputs, textual inputs, and their fused visual–textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent’s cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM\textsuperscript3 on the GQA benchmark through multi‑agent systems built on distinct reasoning paradigms including ReAct, Plan‑and‑Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning‑layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi‑agent intelligence.
Paperid: 2500,   Poster  
Authors: Zhihao Sun, Zhiying Du, Xitong Yang, Zuxuan Wu
Title: HandWorld: Hand-Centric Unified Video Action Generation
Abstract: Handobject interaction forms the foundation of how humans interact with the world.Understanding the connection between hand action and egocentric video is essential for enabling embodied agents to perceive, simulate, and plan like humans. However, it is challenging to learn and predict across handactions and egocentric videos due to their non-linear relationship. In this work, we introduce HandWorld, a unified generative framework that focuses on hand-object interaction and jointly models egocentric videos and hand actions. HandWorld learns shared cross-domain conditions through a dual-branch condition network that integrates information from both video and action domains. MANO-rendered hand representation is incorporated as an intermediate input to further enhance cross-domain coherence.Conditioned on the shared representation, two decoupled diffusion transformers are trained to predict in their respective domain. A flexible training strategy enables the model to learn across diverse task configurations, including action forecasting and controllable video generation. Experiments on large-scale egocentric HOI datasets demonstrate that HandWorld achieves high-fidelity video synthesis and accurate action prediction, outperforming existing baselines across diverse scenarios.
Paperid: 2501,   Poster  
Authors: Heng Li, Xiaotong Lin, Ling-An Zeng, Yulei Kang, Shuai Li, Jian-Fang Hu
Title: MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching
Abstract: Textto-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we proposeMotionHiFlow, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components.
Paperid: 2502,   Poster  
Authors: Wenchao Guan, Chuan Lin, Sihan Huang, Xiongzhen Wang, Xintao Pang
Title: BDNet:Bio-Inspired dual-backbone Small Object Detection Network
Abstract: In remote sensing images, small objects often suffer from low color contrast and blurred edges, resulting in suboptimal feature extraction performance. Physiological studies indicate that the LGN/V1–V2–V4 pathway offers color opponency sensitivity and hierarchical enhancement advantages for the extraction of color information, while the V1–V4 pathway shows strong orientation selectivity in edge information extraction. The integration of these two types of information in the V4 region significantly improves target discrimination.Inspired by this, this paper proposes a dualbackbone network (BDNet) to enhance small object feature extraction. BDNet adopts a dual-backbone parallel structure to capture fine-grained features from color and edge dimensions: the color extraction backbone simulates the color antagonistic mechanism in the LGN/V1 region by designing a Color Antagonism Module (CAM) to amplify color differences, and further mimics the chromatic processing hierarchy in the V2 region with a Visual Cortex Hue-enhancement Module (VCHM) to enrich hue representations. These two components work collaboratively to address the issue of low color contrast.The edge extraction backbone simulates the orientation selectivity of receptive fields in the V1 region by designing an Orientation Selective Module (OrSM) to select and enhance salient edges, thereby mitigating the issue of edge blurring caused by dispersed edge information. Finally, the two types of extracted features are interactively integrated through a Feature Fusion Module (FFM) that emulates the integration mechanism in the V4 region, generating a comprehensive feature representation.Experiments demonstrate that BDNet outperforms state-of-the-art (SOTA) methods on the VisDrone2019, NWPU VHR-10, and AI-TODv2 datasets, thus providing a bio-inspired solution for small object detection in remote sensing images.
Paperid: 2503,   Poster  
Authors: Xu Cao
Title: Enhancing Video VLM with Visual-Audio Supersensing
Abstract: Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce VisualAudio Supersensing (VAS), a learning paradigm that shifts the focus from temporal predictive sensing (e.g., Cambrian-S) to cross-modal prediction. The core objective of VAS is to train the model to anticipate audio-caption summarizations from video and vice versa. We present VA-R1, a VLM that operationalizes this paradigm. Instead of passively ingesting all data, VA-R1 actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune VA-R1 with VAS, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our VA-R1-7B and VA-R1-8B models achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.
Paperid: 2504,   Poster  
Authors: Surabhi Gupta, Dinesh Prabhu Muthumariappan, Biplab Das, Anoop Kolar Rajagopal, Kiran Iyer, Donghwan Seo
Title: PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
Abstract: Eye images captured from wearable devices such as headmounted displays (HMDs) contain identifiable biometric cues, posing significant challenges for safe data sharing. Existing eye anonymization techniques often degrade downstream performance, particularly gaze estimation while still retaining iris-recognizable features. Although these methods aim to anonymize the iris, they introduce noticeable visual artifacts that reduce image fidelity. To address these limitations, we propose PrivateEyes, a privacy-preserving framework that synthesizes anonymized yet gaze-consistent eye images. Our approach employs a three-stage pipeline: (1) a deep segmentation network that isolates semantic eye regions and provides structural control signals for synthesis, (2) a pose estimation network (PEN) trained on anatomically accurate synthetic eye renders to infer precise eye pose, and (3) a conditional diffusion model that reconstructs realistic, anonymized eye images conditioned on segmentation and pose. Extensive experiments across multiple benchmark datasets show that PrivateEyes achieves superior gaze-estimation accuracy compared to state-of-the-art anonymization baselines, improving performance by over 10% while reducing iris-recognition accuracy by ~50%. Our method also produces higher-fidelity images compared to other existing approaches. By enabling task-preserving and privacy-secure sharing of eye images, PrivateEyes supports responsible research and development in AR/VR and other gaze-driven applications.
Paperid: 2505,   Poster  
Authors: Rongbin Zheng, Wensheng Li, Lingzhe Zeng, Dongwang Dongwang, Chengying Gao
Title: Illumination-Consistent Human-Scene Reconstruction from Monocular Video
Abstract: Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects—resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatiallyvarying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize a human mesh and then anchor Gaussians to the refined surface. In addition, we introduce an implicit shadow estimation module that disentangles cast shadows from the scene, thus supporting plausible human shadow synthesis. Our framework also facilitates human relighting and compositing into novel scenes with contextually appropriate lighting. Quantitative and qualitative results demonstrate that our method achieves state-of-the-art performance, producing consistent appearances, realistic illumination, and enhanced overall scene realism.
Paperid: 2506,   Poster  
Authors: Jonas Ernst, Wolfgang Boettcher, Lukas Hoyer, Jan Lenssen, Bernt Schiele
Title: Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation
Abstract: We present Rewis3d, a framework that leverages recent advances in feedforward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student–teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7 % without requiring additional labels or inference overhead. Our code will be released upon acceptance of the paper.
Paperid: 2507,   Poster  
Authors: Haojun Qiu, Kiriakos Kutulakos, David B. Lindell
Title: Efficient and Training-Free Single-Image Diffusion Models
Abstract: We consider the problem of generating images whose internal structuredefined by the distribution of patches across multiple scales---matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.
Paperid: 2508,   Poster  
Authors: Zequn Yang, Yake Wei, HaoTian Ni, Zhihao Xu, Di Hu
Title: Information-Theoretic Decomposition for Multimodal Interaction Learning
Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, informationtheoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning.
Paperid: 2509,   Poster  
Authors: Yan-Ting Chen, Hao-Wei Chen, Tsu-Ching Hsiao, Chun-Yi Lee
Title: Landscape-Awareness for Geometric View Diffusion Model
Abstract: Accuracy camera viewpoint estimation under sparseview conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123, which synthesize novel views conditioned on relative viewpoint, and have demonstrated promising performance when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, which makes them sensitive to initialization and reliant on na\"ive multi-start strategies to achieve reasonable results. We analyze these optimization challenges and visualize failure cases, showing that ambiguities in object geometry, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.
Paperid: 2510,   Poster  
Authors: Jiang Xu, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang
Title: OctoT2I: A Self-Evolving Agentic Text-to-Image Router
Abstract: The explosive growth of Textto-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (e.g., style, color, count) and then intelligently explores their combinations via an iterative "Propose--Solve--Evaluate--Learn" (PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.
Paperid: 2511,   Poster  
Authors: Zhiqiang Yan, Yuan Wu, Gim Hee Lee
Title: Zero-Shot Depth Completion with Vision-Language Model
Abstract: Vision language models (VLMs) have achieved remarkable success in semantic understanding tasks under language guidance, yet their potential for geometric perception remains largely underexplored. This paper introduces the first VLMbased depth completion framework. With almost no architectural modifications, we propose a sparse depth injection mechanism that extends the capability of VLM toward 3D perception through three key aspects: visual tokenization, textual prompt, and textual supervision. At the visual input side, sparse depth is tokenized to provide absolute scale and accurate geometric cues, alleviating the scale and camera ambiguities of RGB-only inputs. At the textual input side, a binary mask derived from sparse depth serves as a prompt, instructing the model where to complete and where to preserve. At the supervision side, the model is fine-tuned using text labels generated from sparse depth, requiring no ground-truth depth. Benefiting from the strong semantic priors and cross-modal expressiveness of VLM, our framework achieves superior zero-shot performance across diverse sensors, sparsity levels, and scenes.
Paperid: 2512,   Poster  
Authors: Lalit Manam, Venu Madhav Govindu
Title: Parallel Rigidity Matters for Bundle Adjustment
Abstract: Bundle adjustment is a longstanding problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct aspect of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of our analysis on a global structure-from-motion pipeline. Applying our proposed algorithm results in significantly cleaner reconstructions by removing misplaced cameras and 3D points.
Paperid: 2513,   Poster  
Authors: Shilin Xu, Dezhong Peng, Zhenwen Ren, Yuan Sun
Title: EXOTIC: External Vision-driven Incomplete Multi-view Classification
Abstract: Due to sensor failures and occlusions during data acquisition, multiview data often suffer from partial missing samples, thereby producing incomplete multi-view data. Recently, Incomplete Multi-View Classification (IMVC) has become one of the research hot topics, where numerous IMVC methods have been proposed. Although these methods have achieved promising performance by exploiting internal semantic information from partially observed data, they primarily rely on limited internal supervision for view completion. Clearly, this largely constrains their performance ceiling. To overcome this limitation, we propose an EXternal visiOn-driven incomplete mulTi-vIew Classification (EXOTIC) paradigm that incorporates external vision knowledge as semantic guidance, thereby assisting in imputing incomplete views. To the best of our knowledge, it is the first work that leverages external vision knowledge as supervision signals, thereby guiding missing-view completion. Specifically, we first introduce an external vision knowledge library based on a pre-trained vision–language model. Then, we design a Knowledge Filtering module to adaptively select task-relevant knowledge. Afterwards, we present a Knowledge Purification module to align external knowledge with internal representations. Finally, we propose External Completion that leverages the refined knowledge to impute missing views, thereby enhancing the classification decision ability. Extensive experiments on multiple incomplete multi-view datasets demonstrate that the proposed EXOTIC consistently outperforms existing methods, especially under high missing rates.
Paperid: 2514,   Poster  
Authors: Yajie Liu, Jinjin Zhang, Qingjie Liu, Di Huang
Title: Reconstructing CLIP for Open-Vocabulary Dense Perception
Abstract: Largescale vision–language models (VLMs) such as CLIP have excelled in zero-shot image classification, yet they struggle to achieve the dense cross-modal alignment required by open-vocabulary dense perception (OVDP). While recent self-distillation methods address this by aligning dense features with the generalizable global semantics, a key question remains: how should such dense features be constructed to achieve optimal alignment? To address this, we propose DenseRC, a principled Dense Representations Construction framework that reconstructs CLIP for OVDP based on two key insights.First, by analyzing the internal semantics encoded in the global cls token, we identify that multi-layer value embeddings serve as an informative basis for dense features. Second, we reveal that spatial aggregation tends to amplify semantic misalignment. Motivated by this, we design a lightweight Head-Selective Gating (HSG) module that adaptively reweights feature heads according to their intrinsic heterogeneity, enabling discriminative and alignment-friendly dense representations construction. Extensive experiments demonstrate that DenseRC delivers consistent and substantial gains across OVDP tasks including object detection and semantic segmentation, setting new state-of-the-art performance on multiple benchmarks.
Paperid: 2515,   Poster  
Authors: Junhao Chen, Gao Kejun, Yuehan Cui, Mingze Sun, Mingjin Chen, Shaohui Wang, Xiao-Xiao Long, Fei Ma, Qi Tian, Hao Zhao, Ruqi Huang
Title: Tokenizing Vector Animation for Autoregresive Generation
Abstract: Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolutionindependence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).
Paperid: 2516,   Poster  
Authors: Weiying Xie, Xiaoyu Chen, Xin Zhang, Chenhe Hao, Jitao Ma, Yunsong Li, Leyuan Fang
Title: Saliency-Driven Token Merging for Vision Transformers
Abstract: Vision Transformers (ViTs) exhibit robust performance across diverse visual scenarios. However, their efficiency is constrained by excessive token counts. Token merging offers a viable solution for achieving efficient ViTs. Existing methods merge tokens based solely on specific characteristics within the attention mechanism, which changes significantly across different layers. In this paper, we propose a novel trainingfree SAliency-Driven Token Merging (SAD-TM) approach by leveraging not only the semantic relevance in the attention space but also the latent visual saliency of input patches. Our SAD-TM is inspired by the discovery that saliency-based statistics can directly capture the causal relationship between model input and output, regardless of the layers. Based on the observation, we develop a method that is mathematically formulated to merge tokens with high saliency outliers. The principle behind our merging is that tokens with high saliency outliers usually imply inconsistencies with the global gradient direction, and thus can be merged safely. Besides, our systematic analysis indicates that class attention shows considerable variation across early blocks, so a deferred merging strategy is introduced to optimize the selection of merging rates. In a training-free manner, SAD-TM demonstrates superior performance across various ViT architectures. Especially, with a FLOPs compression of 23.08% on DeiT-Tiny, SAD-TM achieves a Top-1 Accuracy comparable to that of the pretrained baseline on ImageNet dataset. The code will be available soon.
Paperid: 2517,   Poster  
Authors: Peiyang Ni, Longyu Yang, Lu Zhang, Kuniaki Saito, Yap-Peng Tan, Fumin Shen, Heng Tao Shen, Xiaofeng Zhu, Ping Hu
Title: Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
Abstract: Adverseweather LiDAR point cloud generation is challenged by complex weather-induced degradations. These degradations affect geometry and reflectance in fundamentally different ways, making joint modeling difficult and ambiguous, especially when diverse real-world training data is limited. To address this, we propose Structure-to-Intensity Diffusion (SiD), a diffusion-based framework that explicitly factorizes the denoising process at each time step: it first reconstructs the geometric structure, then conditions reflectance intensity denoising on the estimated structure. This structure-conditioned design decomposes the joint distribution, reduces modeling ambiguity, and leads to point clouds that are both geometrically coherent and radiometrically realistic. To mitigate data scarcity, we introduce Real-Prior Weather Simulation (RPWS), a degradation module that leverages real-world sensor statistics to synthesize physically plausible adverse-weather point clouds from clear scans. Extensive experiments demonstrate that, with similar model complexity, our approach outperforms the previous state-of-the-art in generating adverse-weather LiDAR scans with both structural and radiometric properties more closely aligned with real-world data.
Paperid: 2518,   Poster  
Authors: Sheldon Fung, Wei Pan, Ling Cao, Fei Hou, Ling Chen, Shasha Mao, Hongdong Li, Xuequan Lu
Title: SuP: Sub-cloud Driven Point Cloud Registration
Abstract: While existing pointcloud-registration methods can well handle high-overlap scenarios of two point clouds, they often struggle with low-overlap scenarios, due to inevitable geometric/semantic ambiguities in the non-overlapping regions. In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final correspondences are generated through a merge-to-match module using the anchor pairs. To train DSAM, we design an alignment-aware weighting loss that uses on-the-fly alignment errors as supervision. Comprehensive experiments on the color-enhanced 3DMatch and 3DLoMatch demonstrate that SuP significantly outperforms state-of-the-art methods, achieving higher registration recall and more accurate alignment, especially under challenging low-overlap conditions.
Paperid: 2519,   Poster  
Authors: Joonki Min, Chaeyun Kim, Hyungwook Choi, Yejin Kim, Kihyun Kim, Yohan Jo, Joonseok Lee
Title: Fine-Grained Multi Image Object Hallucination Benchmark
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in multiimage scenarios requiring complex reasoning across visual contexts. However, current MLLMs remain fundamentally limited by object hallucination—generating plausible yet factually inconsistent descriptions about objects. Existing benchmarks, designed primarily for single-image settings or providing only high-level multi-image assessments, cannot systematically diagnose how visual complexity and reasoning demands trigger hallucination. To address this gap, we introduce MIOH, a fine-grained multi-image object hallucination benchmark that systematically evaluates object hallucination across four foundational tasks (existence, counting, attribute, position) through three multi-image reasoning patterns (comprehensive, comparative, selective) under three controlled adversarial pressures (visual context scale, perceptual difficulty, contextual bias). Through evaluation of 30 models, we reveal that even state-of-the-art systems like GPT-5 and Gemini-2.5-Pro exhibit distinct failure patterns across different reasoning patterns and tasks. Our evaluation reveals that hallucination stems not merely from perceptual failures but from integration-stage limitations when maintaining object representations across multiple images. MIOH provides a controlled framework for analyzing multi-image object hallucination and serves as a critical evaluation tool for developing more reliable multimodal AI systems.
Paperid: 2520,   Poster  
Authors: Wangwang Jia, Zijian Gao, Tianjiao Wan, Yuan Cao, Yong Dou, Kele Xu
Title: Geometry-driven OOD Detectors Are Class-Incremental Learners
Abstract: ClassIncremental Learning (CIL) seeks to acquire new classes over time without erasing prior knowledge. While recent methods leverage pre-trained models (PTMs) to curb forgetting, they largely optimize the feature extractor and overlook the crucial classification head. In this work, we advance a simple view: if each task is equipped with a classifier that has the ability to both recognize in-distribution (IND) classes and reject out-of-distribution (OOD) inputs, CIL arises naturally—inputs are accepted only by heads that deem them in-distribution and rejected otherwise. Supported by rigorous theoretical and empirical studies, we find that this ability is characterized by Inter-class Separation and Intra-class Compactness; lacking these, standard linear and cosine-similarity heads remain closed-set and fail to yield a usable OOD signal. To address this, we propose GOD (Geometry-driven OOD Detectors), which unifies IND recognition and OOD rejection in a single geometric space by replacing the learnable head with fixed Equiangular Tight Frame (ETF) anchors; an ETF loss enforces inter-class separation, and an ArcFace loss further tightens intra-class compactness. For efficiency, we further introduce a parameter-efficient hybrid architecture and an efficient inference strategy, thus reducing both parameter footprint and inference cost. Extensive experiments on multiple incremental settings and datasets show that GOD achieves state-of-the-art results.[^1]: Code and datasets are available in the supplementary material.
Paperid: 2521,   Poster  
Authors: Koshiro Nagano, Ryo Fujii, Ryo Hachiuma, Fumiaki Sato, Taiki Sekii, HIDEO SAITO
Title: Synthetic Knowledge-Guided Learning via Target-Region Gradients
Abstract: Training with synthetic data has become a standard strategy for improving robustness to distribution shifts. However, most existing approaches exploit synthetic samples only indirectlyfor example, by enriching backgrounds, contexts, or negative examples---while providing no explicit signal about where the true target content resides.As a result, models can continue to rely on spurious correlations, which ultimately limit their robustness. In this work, we convert a basic but under-utilized provenance of synthetic data into explicit supervision: during synthesis, we know which pixels or elements originate from which source instances. We formalize this provenance as synthetic knowledge and propose a Synthetic Knowledge-Guided (SKG) training framework that uses it to shape gradients toward target regions and away from irrelevant ones via a Gradient Guide Loss. Our framework is generic and can be seamlessly integrated into diverse synthesis pipelines, including mixing-based synthesis and generative editing-based synthesis, without additional human annotations. Experiments on image classification, weakly supervised object localization, and weakly supervised spatio-temporal action localization show consistent gains over strong baseline methods. These results demonstrate that making provenance in synthetic data is an effective and widely applicable mechanism for mitigating shortcut learning and enhancing robustness.
Paperid: 2522,   Poster  
Authors: Rashi Sharma, Justin Timothy C. Bersamin, Karthikk Subramanian
Title: Portable Active Learning for Object Detection
Abstract: Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in realworld applications. Prior active learning (AL) methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.
Paperid: 2523,   Poster  
Authors: Yang Zheng, Jiahua Liu, Tongyao Pang, Wen Li, Zhaoqiang Liu
Title: Outlier-Robust Diffusion Solvers for Inverse Problems
Abstract: Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DMbased methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.
Paperid: 2524,   Poster  
Authors: Jian Zhang, Xincheng Yu, Yi Lin
Title: Rethinking Occlusion Modeling for UAV Tracking
Abstract: Occlusion remains one of the major challenges in UAV tracking, where dynamic viewpoints and complex environments often cause partial or complete visibility loss.Existing transformerbased trackers typically regard occlusion as random information dropout, overlooking its structured and spatially correlated nature in real-world scenes.We rethink occlusion modeling in UAV tracking as a structured process governed by spatial dependencies.Based on this insight, we introduce Clustered Occlusion Modeling (COM) to generate realistic, density-adaptive occlusion patterns that enhance feature robustness under partial visibility.Furthermore, we design Cost-Aware Depth Bias (CADB), which employs a depth-dependent prior to adjust inference depth, yielding better efficiency while maintaining competitive accuracy.Integrating COM and CADB into a unified single-stream transformer framework, termed OCTrack, our tracker achieves robust and efficient UAV tracking in occlusion-prone environments.Extensive experiments on multiple UAV benchmarks validate its effectiveness and demonstrate state-of-the-art performance. Code will be released.
Paperid: 2525,   Poster  
Authors: Hua Chang, Xin Xu, Wei Liu, Jiayi Wu, Kui Jiang, Fei Ma, Qi Tian
Title: TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
Abstract: Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and longterm degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Extensive experiments on our constructed benchmark for opera videos, OperaLQ, demonstrate that TextOVSR outperforms state-of-the-art methods in both qualitative and quantitative evaluations.
Paperid: 2526,   Poster  
Authors: jusheng zhang, Kaitong Cai, Jing Yang, Jian Wang, Keze Wang
Title: Dynamics-Aware Preference Optimization for Vision-Language Models
Abstract: Preferencebased finetuning of vision-language models (VLMs) is notoriously unstable, as trivially wrong negatives inject uninformative gradients that distort optimization and degrade calibration. This work revisits this issue through the lens of learning dynamics and identifies a core pathology, the squeezing effect, where easy negatives retain large, misaligned gradients despite having negligible loss.To address this, we propose Cooling-Weighted Direct Preference Optimization (CW-DPO), a two-stage framework that first smooths and then stabilizes the alignment process. Stage 1 employs a constrained SFT phase with low-weight “gentle negatives’’ to regularize overconfident distributions and flatten the loss landscape. Stage 2 introduces a competence-aware cooling weight that adaptively scales negative gradients according to the model’s average per-token log-probability, suppressing uninformative updates while emphasizing hard, on-policy contrasts. This dynamics-aware weighting effectively mitigates the squeezing effect and enables smoother convergence.Extensive experiments on mainstream benchmarks—including COCO, Flickr30k, NoCaps, MMMU, and MMBench1.1—show that CW-DPO achieves state-of-the-art performance, for example +3.4 CIDEr over PPO and +2.4% absolute accuracy on MMMU, while improving calibration and halving convergence steps. This demonstrates that smoothing before cooling constitutes a simple yet general principle for robust VLM preference optimization.
Paperid: 2527,   Poster  
Authors: jin zhang, Zhe Cao, Biwen Yang, Ruiheng Zhang
Title: Annotation-Efficient Coreset Selection for Context-dependent Segmentation
Abstract: Contextdependent (CD) tasks demand the model to have advanced visual understanding ability, such as recognizing camouflaged objects and medical lesions. Current CD methods rely heavily on pixel-level annotated training sets, neglecting issues from redundant samples and the high annotation costs. In this paper, we address the pruning needs of CD datasets, focusing on selecting the most valuable samples for labeling and training using weak annotations. To achieve this, we decompose CD coreset selection into two steps: sample evaluation and coreset selection, proposing corresponding solutions: points-based optimal transport and a maximum distance entropy strategy. Specifically, we formulate sample evaluation as an optimal transport problem between foreground and background distributions, designing a foreground destruction-reconstruction process based on points to compute transport costs and score samples. For samples of varying importance, our selection strategy balances coreset coverage and diversity. We validate our method on six CD tasks, achieving 1% accuracy loss relative to full training under a 40% pruning rate.
Paperid: 2528,   Poster  
Authors: Hoonhee Cho, Yuhwan Jeong, Kuk-Jin Yoon
Title: Event-based Motion Deblurring with Unpaired Data
Abstract: Event cameras provide hightemporal-resolution, motion-centric measurements that remain reliable under fast motion and challenging illumination, making them a promising sensing modality for motion deblurring. However, existing deblurring methods typically require large-scale paired blur–sharp datasets, which are extremely difficult to obtain in real-world settings, especially when an additional modality such as events is involved. In this work, we introduce EMP, an event-based motion deblurring framework that operates entirely in an unpaired setting, removing the need for aligned blur–sharp supervision. EMP bridges the disjoint blur and sharp domains through event information and leverages two complementary training mechanisms tailored to the unpaired regime: (1) an event-based physical prior with confidence masking that provides reliable self-supervisory signals for blurry inputs, and (2) a generative blur modeling process that extracts blur-related frequency-domain cues from blur–event pairs and transfers them to sharp images to synthesize realistic blur. As a result, these mechanisms enable stable and effective deblurring without requiring paired labels. Extensive experiments on various real-event datasets, including REBlur, EventAid, and HighREV, show that EMP outperforms existing unpaired baselines and achieves performance competitive with paired methods. We will make our code publicly available to the research community.
Paperid: 2529,   Poster  
Authors: Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, Yu-Shen Liu
Title: MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer
Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in realtime applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.
Paperid: 2530,   Poster  
Authors: Xuanchen Lu, Ang Cao, Chao Feng, Andrew Owens
Title: Generative Point Tracking and Trajectory Forecasting
Abstract: Motion forecasting predicts where points will move in the future, while motion tracking predicts where they are in the present. Despite these conceptual similarities, existing approaches to these two problems are quite different. In this paper, we propose a unified model that can address both tasks. We train a causal, videoconditioned flow matching model to predict point positions. The resulting model can easily toggle between point tracking to forecasting by changing its visual signal. Despite our model's simplicity, we find that it outperforms prior work in point forecasting and obtains performance that is competitive with the state-of-the-art on the TAP-Vid DAVIS benchmark.
Paperid: 2531,   Poster  
Authors: xiangpeng yang, Ji Xie, Yiyuan Yang, Yue Ma, Yan Huang, Min Xu, Qiang Wu
Title: Unified Video Editing as Temporal Reasoner
Abstract: Existing video editing methods face a critical tradeoff: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a "seeing, reasoning, then editing" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach.
Paperid: 2532,   Poster  
Authors: Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, Ikuro Sato
Title: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Abstract: Recent progress in deep learning has been driven by increasingly largescale models, but the resulting computational cost has become a critical bottleneck.Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of expert networks for each input, achieving high scalability with limited computation.Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives gradients only from the experts it selects in each forward pass, its learning signal is highly localized, with little information about the broader expert space.This limited gradient feedback can lead the router toward suboptimal configurations, for example collapsing to only a few experts when no auxiliary losses are used, and it has also been associated with fluctuating expert selections during training. These behaviors suggest that task-driven signals alone do not provide sufficient guidance for learning robust routing behavior in sparse MoE.To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model.TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training.Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.
Paperid: 2533,   Poster  
Authors: Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller, Jiawen Chen, Ilya Chugunov
Title: Hist2Style: Histogram-Guided Stylization with Bilateral Grids
Abstract: Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for highresolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model is trained to reproduce the spatially varying color edits available in larger image editing models. This training paradigm involves generating a large supervised corpus with language and vision-language models and distilling a high-capacity editor into a lightweight model. The model conditions on a histogram-based embedding of the style target, which provides an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.
Paperid: 2534,   Poster  
Authors: Zhijing Sun, Senyan Xu, Ruixuan Jiang, Kean Liu, Runze Tian, Xueyang Fu, Zheng-Jun Zha
Title: Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition
Abstract: Motion blur is a common degradation in dynamic imaging. Recent studies have moved beyond restoring a single sharp image from a blurred input and instead target blur decomposition: recovering a temporally continuous sharp video sequence from one motionblurred image. Event cameras, with their microsecond temporal resolution, can effectively alleviate motion ambiguity.However, existing event-based methods often fail to explicitly model time-aligned event–image features. How to accurately exploit event data to reconstruct frames at different time instants remains largely underexplored. In this paper, we propose TSANet, an event-based blur-to-video decomposition method that time-specializes both event features and image features for alignment. Specifically, we introduce a Relative Time-Encoded Attention module that steers event features toward motion information relevant to a given target time, and a Timesurface Dynamic Warping Module that warps image features into the spatial configuration corresponding to that time. With time-specialized motion features and image features that are explicitly aligned at arbitrary query times, our framework can decompose a single blurred image into a high–frame-rate sharp video sequence. In addition, we collect a new dataset containing real events and high-quality color video, and synthesize blurred inputs by averaging sharp frames to evaluate our method. Experiments on multiple datasets with both synthetic and real events demonstrate that our approach consistently outperforms previous state-of-the-art methods on the blur decomposition task.
Paperid: 2535,   Poster  
Authors: Kartik Patwari, Noranart Vesdapunt, Chien-Yi Wang, Dawei Li, Cong Phuoc Huynh, Ning Zhou, Chen-Nee Chuah, Kah Fu Fu
Title: Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
Abstract: Recent advancement in visionlanguage models have enabled multi-modal person re-identification (Re-ID), where the system takes both an image and a text query to identify matching individuals. While previous state-of-the-art methods perform well with detailed, sentence-level descriptions, we found that their Recall@1 drops by half when using short, keyword-based queries due to ambiguity, training biases, and under-represented attributes. Despite this challenge, short queries provide a more natural and efficient user experience, requiring less effort and allowing for iterative refinement. To address this limitation, we introduce a new problem setting, Composite-Attributes Person Re-ID (CA-ReID), along with a fine-grained composite attribute dataset with queries belonging to varying levels of ambiguity. We further propose two methods: Dense Disentangling Loss to promote attribute-specific embeddings, and Part-Aware Representations that use pose estimation to align textual attributes with relevant body regions. Our method sets a new state of the art on the new CA-ReID benchmark (up to +17% Recall@1) and performs on par with prior methods on existing CC-ReID benchmarks. We will release our dataset to support this emerging direction.
Paperid: 2536,   Poster  
Authors: Jianzhe Gao, Churan Wang, Weiyi Zhang, Jianghua Li, Lian Li, Wenguan Wang, Yixin Zhu, Yizhou Wang
Title: Medical Video Diagnosis via Counterfactual Reasoning
Abstract: Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an endto-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.
Paperid: 2537,   Poster  
Authors: Zhe Chen, Fanhui Meng, Tianyang Xu, Xiaojun Wu
Title: Cluster-aware Anchor Learning for Multi-View Clustering
Abstract: Anchorbased multi-view clustering is attractive for its efficiency, yet most methods fix the number of anchors a priori, implicitly assuming uniform needs across clusters. In practice, clusters differ in information richness, scale, and intrinsic structure, motivating adaptive per-cluster anchor allocation. We propose Cluster-aware Anchor Learning (CAL), which learns a consensus anchor matrix and organizes its columns into cluster-specific anchor groups. CAL imposes an \ell_2,1-norm column-sparsity penalty on each group to suppress redundancy and preserve cluster-discriminative features, thereby automatically determining how many anchors each cluster retains. To further enhance separability, CAL introduces an inter-cluster regularization that constrains relationships among groups, promoting mutual dissimilarity. This data-driven design learns higher-quality, cluster-aware anchors and yields a more discriminative representation matrix across multiple views. Extensive experiments on multiple benchmarks show that CAL outperforms state-of-the-art multi-view clustering methods, demonstrating superior effectiveness, robustness, and adaptability to heterogeneous cluster structures.
Paperid: 2538,   Poster  
Authors: Paul Walker, James Gardner, Andreea Ardelean, William Smith, Bernhard Egger
Title: VENI: Variational Encoder for Natural Illumination
Abstract: Inverse rendering is an illposed problem, but priors like illumination priors, can simplify it.Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space.We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections.To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder.In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons.We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model.Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.
Paperid: 2539,   Poster  
Authors: Fiona Ryan, Ishwarya Ananthabhotla, Yijun Qian, Judy Hoffman, James Rehg, Vamsi Krishna Ithapu, Calvin Murdock
Title: Forecasting 3D Scanpaths in Egocentric Video
Abstract: Forecasting gaze behavior is an important task for understanding user intent and creating AR/VR systems that can anticipate where users will look and interact next. While prior works have addressed predicting scanpaths in static images, forecasting gaze in egocentric videos presents new challenges due to the dynamic nature of the scene and the camera wearer’s continuous movement through the 3D environment. To address these challenges, we formulate the novel task of egocentric scanpath prediction as forecasting a sequence of future fixations in 3D Cartesian coordinates relative to the last observed camera pose, producing a 3D scanpath that is grounded in the environment. We propose a transformer architecture that leverages egocentric video frames, head pose, and past 3D gaze observations to predict future 3D fixation sequences. We evaluate our method on the Aria Digital Twin dataset. Our findings establish a baseline for the novel task of 3D scanpath prediction and highlight important architectural elements for our task.
Paperid: 2540,   Poster  
Authors: Jongmin Lim, Soobin CHA, Jaehun Park, Inho Oh, Minho Park, Kwangsu Kim
Title: Data-Centric Meta-Learning for Robust Few-Shot Generalization
Abstract: Fewshot learning aims to enable rapid adaptation to unseen tasks using limited data. Optimization-based meta-learning addresses this challenge by acquiring shared prior knowledge across diverse tasks. However, its effectiveness degrades in cross-domain scenarios where unseen tasks differ significantly from training tasks. We identify this degradation as a failure to acquire generalizable prior knowledge, which is fundamentally caused by gradient discrepancies—conflicting update directions arising in the meta-training environment with diverse task distributions. To achieve robust few-shot generalization, we propose Data-Centric Meta-Learning (DCML), a novel framework that mitigates gradient discrepancies by aligning task-specific input distributions with shared prior knowledge. DCML accomplishes this alignment through a meta-learnable visual prompt that is integrated into the entire meta-learning process—unlike previous prompt-based methods restricted solely to test-time adaptation. During meta-training, the prompt transforms each task’s inputs to induce more consistent gradients, thereby facilitating the learning of generalizable prior knowledge. Leveraging this robust knowledge, DCML enables rapid and parameter-efficient test-time adaptation by updating only the lightweight prompt and classifier while keeping the backbone frozen.Extensive experiments demonstrate that DCML consistently outperforms baselines, particularly in challenging few-shot cross-domain scenarios, establishing a data-centric perspective for robust meta-learning.
Paperid: 2541,   Poster  
Authors: Zeyu Chen, Fangmin Zhao, Yan Shu, Yichao Liu, Liu Yu, Yu ZHOU
Title: StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
Abstract: Styleconditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.
Paperid: 2542,   Poster  
Authors: Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, Wei Ji, Lei Cui, Furu Wei, Hao Fei, Mong-Li Lee, Wynne Hsu
Title: Benchmarking Unified Any-to-Any Interleaved Multimodal Learning
Abstract: In realworld multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. Our resource will all be open.
Paperid: 2543,   Poster  
Authors: Young-Han Son, Dong-Hee Shin, Deok-Joong Lee, Hyun Jung Lee, Tae-Eui Kam
Title: Optical Diffraction-based Convolution for Semiconductor Lithography
Abstract: In recent years, the increasing demand for smaller and more powerful semiconductors highlighted the critical role of lithography—a key stage in semiconductor manufacturing responsible for precise mask design and wafer patterning. To meet these demands, the semiconductor industry has increasingly adopted computational lithography, employing machine learning and deep learning techniques to accelerate advancements in lithographic technology. Despite the various research efforts and successes in computational lithography, there remains a lack of explicit incorporation of physical principles. This gap limits the ability of existing methods to fully capture the complex physical phenomena inherent in lithography behaviors. To bridge this gap, we propose OptiCo, a novel convolutional neural network that seamlessly integrates optical diffraction principles into its architecture. At its core, OptiCo employs an optical phase kernel to model phase variations resulting from light propagation, effectively capturing the physical interactions among light, masks, and wafers. We evaluate OptiCo on semiconductor lithography benchmarks, demonstrating its superior performance in mask optimization tasks, with its remarkable generalization capabilities in OOD datasets.
Paperid: 2544,   Poster  
Authors: Haoyu He, Yue Zhuo, Yu Zheng, Qi Wang
Title: Structural Graph Probing of Vision–Language Models
Abstract: The internal organization of vision–language models (VLMs) remains poorly understood, particularly how they distribute and fuse information across layers. We take a topologyfirst perspective and analyze VLMs through the interaction graphs induced by neuron–neuron correlations, treating each layer as a structured computational network rather than a sequence of token transformations. Operating solely on these graphs, we show that global connectivity patterns are strongly predictive of model behavior across grounded reasoning, counting, and hallucination tasks. Modality-separated graphs reveal that cross-modal fusion strengthens sharply in mid-to-late layers, while contrastive graph alignment exposes how multimodal training reorganizes topology relative to text-only backbones. Targeted interventions on high-degree neurons further demonstrate their causal influence, indicating that VLMs route multimodal reasoning through sparse but structurally critical hubs. These results highlight interaction topology as a powerful, model-agnostic lens for interpreting and comparing multimodal transformers.
Paperid: 2545,   Poster  
Authors: Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Zehuan Yuan, BINGYUE PENG
Title: InfinityHuman: Towards Long-Term Audio-Driven Human Animation
Abstract: Audiodriven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module.
Paperid: 2546,   Poster  
Authors: Hongjun Wang, Lin Liu, Jianguo Li, Tao Lin
Title: Dual-Granularity Memory for Efficient Video Generation
Abstract: Video generation using recurrent architectures offers compelling efficiency advantages over attentionbased transformers, particularly for long-sequence generation. However, chunked processing in recurrent models creates temporal discontinuities that harm long-range consistency. We introduce two complementary memory mechanisms to address this challenge at different granularities: (1) Context Memory maintains persistent global context within attention chunks through learnable sink columns and boundary buffers, adding only 150K parameters (\textless 0.1% overhead); (2) Latent Context-as-Memory (LCaM) extends memory across video segments by storing and retrieving historical latent embeddings, enabling cross-segment consistency without requiring camera annotations or frame reconstruction. Applied to Generalized Spatial-temporal Propagation Networks (GSTPN), our dual-memory approach achieves 1.54× faster inference than attention-based transformers, while excelling in visual quality metrics. Our approach is particularly effective for knowledge distillation scenarios where only pre-extracted latent embeddings are available. This work demonstrates compelling efficiency-quality trade-offs for practical long video generation.
Paperid: 2547,   Poster  
Authors: Yu Ma, Zizhan Guo, Zuyi Xiong, Haoran Zhang, Yi Feng, Hongbo Zhao, Hanli Wang, Rui Fan
Title: The Midas Touch for Metric Depth
Abstract: Recent advances have markedly improved the crossscene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present Midas Touch for Depth (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D modeling tasks.
Paperid: 2548,   Poster  
Authors: Zhengqing Wang, Saurabh Nair, Prajwal Chidananda, Pujith Kachana, Samuel Li, Matthew Brown, Yasutaka Furukawa
Title: Latent Action Pretraining Meets Pose Estimation
Abstract: This paper revisits camera pose estimation through the lens of selfsupervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos.Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks.Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations.This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods.To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.
Paperid: 2549,   Poster  
Authors: Xuesong Liu, Anke Xu, Wenbo Cao, Emmett Ientilucci
Title: Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
Abstract: Dense scenes containing numerous tiny objects pose a fundamental challenge for segmentation models, where small localization errors can significantly degrade downstream measurements. We present StructureAware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. SARD constructs a structure-importance map that combines boundary salience, local density, and teacher confidence, and uses it to weight a unified representation loss integrating feature consistency, distribution alignment, and structural contrast. This encourages the student to allocate capacity to geometrically informative regions while preserving global context. Experiments on Cityscapes, ADE20K, and a challenging rock fragmentation benchmark (RockFrag) show that SARD consistently improves both mIoU and boundary IoU over strong distillation baselines; on RockFrag, SARD improves a Swin-T student over CWD by +4.3 mIoU and +6.7 bIoU. A ResNet-50 student distilled from a Swin-L teacher achieves up to 7.7× parameter reduction and 9× higher throughput than the teacher, with no additional inference overhead beyond the student network, demonstrating that structure-aware representation distillation is effective and efficient for tiny-dense segmentation.
Paperid: 2550,   Poster  
Authors: Shasha Han, Chong Li, Xinning Wang, Xuebo Li
Title: YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
Abstract: YOLO series lead object detection with superior accuracy and speed. However, both convolutional and selfattention based architectures suffer from parameter redundancy and insufficient computational efficiency. Existing lightweight methods excessively pursue speed while ignoring the loss of important information during feature extraction and spatial transformation across different stages. Thus, effective lightweighting is crucial for detection performance. We propose YOLO-ULM, an ultra-lightweight real-time detector that achieves accelerated inference while preserving high accuracy. We innovatively design a variety of dual efficiency- and accuracy-driven modules, including efficient feature aggregation and multi-scale downsampling modules, as well as a more focused complete-IoU loss function. To validate our approach, we train it from scratch on COCO dataset without pretrained weights. By refining backbone parameters, we extend it to YOLO-ULM-Turbo for accelerated inference. YOLO-ULM surpasses state-of-the-art real-time detectors like YOLOv11/YOLOv12/YOLOv13 and RT-DETR. On a T4 GPU, YOLO-ULM-N achieves 41.6% mAP with an inference latency of 1.52 ms, outperforming YOLOv11-N (2.2%\uparrow) and YOLOv12-N (1.0%\uparrow). YOLO-ULM-S exceeds RT-DETR-R18 by 1.6% mAP with 64.7% fewer FLOPs and 63% fewer parameters. YOLO-ULM-L / X surpass YOLOv13-L / X by 0.7% and 0.8% respectively in mAP. YOLO-ULM-Turbo matches YOLOv12-Turbo's performance but uses less computation, with Turbo-N variant achieving 0.3% higher mAP and 16% fewer parameters than YOLOv12-Turbo-N.
Paperid: 2551,   Poster  
Authors: Heasung Kim, Taekyun Lee, Hyeji Kim, Gustavo De Veciana
Title: Efficient Weighted Sampling via Score-based Generative Models
Abstract: Weighted sampling—sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function—is a fundamental technique with wideranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves 1.2× to 4.7× speedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.
Paperid: 2552,   Poster  
Authors: Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu
Title: RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
Abstract: Conventional visionlanguage models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question–answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets will be released.
Paperid: 2553,   Poster  
Authors: Shih-Wen Liu, Yen-Chang Chen, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang
Title: Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning
Abstract: Multitask learning (MTL) aims to equip a single model with the ability to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce Free Sinewich, a parameter-efficient multi-task learning framework that achieves efficient weight reuse through frequency switching. A lightweight Clock Net first determines task-dependent frequency with negligible overhead (Free). These frequencies modulate a Sine-AWB (Sinewich) layer, where low-rank factors and convolutional priors are combined into a single kernel and transformed via an elementwise sinusoidal transformation to produce task-specialized weights. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Our code is publicly available.
Paperid: 2554,   Poster  
Authors: Mingyu Zhang, lifeng zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yonglu Li
Title: Survive the 1001$^{st}$ Night: Interactive Physical Reasoning
Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire humanlike reasoning from interaction and keep improving with more experience.We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning.Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality.We therefore propose IPR (Interactive Physical Reasoning), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning.Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games.These results support physics-centric interaction as a path to steadily improving physical reasoning. Our code will be publicly available.
Paperid: 2555,   Poster  
Authors: Yijia Guo, Tong Hu, Liwen Hu, Lei Ma, Tiejun Huang
Title: 3D Gaussian Splatting from unposed Spike Stream
Abstract: 3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in highspeed scenarios. Recent advances have integrated spike camera, a bio-inspired sensor with a high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency.To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes fromunposed capturesof the bio-inspired high-temporal-resolution spike camera. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its application to keyframing, effectively alleviating the instability caused by the spike stream. Building upon this foundation, we devise a progressive optimization framework to facilitate swift 3D reconstruction. The experimental results demonstratethat our method achieves up to 7.4dB higher PSNR and 40% lower AbsoluteTrajectory Error (ATE) compared to state-of-the-art methodsunder challenging high-speed scenarios while maintaining the fastest reconstruction speed among spike-based methods.
Paperid: 2556,   Poster  
Authors: Rajeev Dwivedi, Anshuman Dangwal, Vinod Kurmi
Title: Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation
Abstract: Pretrained vision encoders are widely used as frozen, blackbox feature extractors, yet they often inherit spurious correlations that disproportionately harm underrepresented groups. We introduce PLD-Debias, a fully black-box debiasing framework that requires neither access to backbone parameters nor demographic annotations. Our method integrates three components: (1) \emphRank-Regularized Amplification, a lightweight adapter that exaggerates latent spurious directions; (2) \emphUnsupervised Pseudo-Bias Induction, which clusters amplified features to infer high-fidelity proxy bias labels; and (3) \emphBias-Guided Refinement, combining supervised contrastive alignment with cluster-aware adaptive margins to purify representations and equalize decision boundaries. We theoretically show that these components jointly tighten a worst-group risk bound under spurious correlations. Empirically, PLD-Debias achieves state-of-the-art worst-group accuracy across CelebA, Waterbirds, and CMNIST, improving performance by 3--5 points over prior black-box methods while maintaining average accuracy. Remarkably, our pseudo-bias labels align with ground-truth bias annotations at over 90% fidelity, enabling oracle-level robustness without demographic supervision. Our results demonstrate that fairness and utility can be achieved through a plug-and-play classifier adapter for any frozen foundation model.
Paperid: 2557,   Poster  
Authors: Shuyi Ouyang, Gongfan Fang, Xinyin Ma, Yen-Wei Chen, Lanfen Lin, Xinchao Wang
Title: Language-guided Frequency Modulation for Large Vision-Language Models
Abstract: Large VisionLanguage Models (LVLMs) have demonstrated remarkable capabilities in visual reasoning across diverse tasks. These tasks place different demands on visual representations: some prioritize high-level global context, while others emphasize fine-grained local details. However, most existing methods operate on visual representations primarily in the spatial domain, lacking an explicit mechanism for distinguishing between high-frequency local details and low-frequency global context. This limitation hinders fine-grained control of visual representations and complicates their hierarchical alignment with language. To address this issue, we introduce Language-guided Frequency Modulation (LFM), a plug-and-play approach that adaptively refines visual signals in the frequency domain under linguistic guidance. By selectively enhancing critical regions and details, LFM enables more structured and precise visual processing. Crucially, it adds no extra training parameters, relying solely on a lightweight learnable projector to refine visual tokens before integration into the LLM, thereby ensuring minimal computational overhead. Extensive experiments across diverse vision-language benchmarks highlight LFM’s scalability, effectiveness, and broad applicability to LVLMs. The code will be publicly available.
Paperid: 2558,   Poster  
Authors: Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong
Title: DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Abstract: Diffusionbased video super-resolution (VSR) achieves remarkable fidelity but suffers from prohibitive sampling cost. While distribution matching distillation (DMD) accelerates diffusion models to one-step generation, directly applying it to VSR leads to training instability and degraded, insufficient supervision.To address these issues, we propose DUO-VSR, a three-stage framework centered on a DUal-Stream Distillation strategy that integrates distribution matching and adversarial supervision for One-step VSR.We first adopt a Progressive Guided Distillation Initialization to stabilize subsequent training through trajectory-preserving distillation.We then introduce a Dual-Stream Distillation Strategy to jointly optimize DMD and Real–Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision using features from both real and fake score models.Finally, a Preference-Guided Refinement aligns the student with perceptual quality preferences.Comprehensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR methods.
Paperid: 2559,   Poster  
Authors: Qinghao Zhong, Bingzhi Chen, Yishu Liu, Minhua Lu, Guangming Lu
Title: ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
Abstract: Vision Transformers (ViTs) achieve stateof-the-art performance on a wide range of vision tasks, yet they remain highly vulnerable to adversarial perturbations due to the lack of explicit region-level semantic modeling. Adversarial perturbations are typically local and spatially structured, whereas the globally coupled self-attention and spatially uniform feed-forward networks in ViTs propagate local corruptions across the whole image without enforcing consistency within semantically coherent regions. To mitigate this mismatch, we propose Region-aware Mixture-of-Experts, namely "ReMoE", a plug-and-play module that replaces the standard feed-forward network (FFN) with a region-aware expert layer. Specifically, our ReMoE strategically introduces multi-granularity experts (i.e., global, center, and regional) and couples them with an attention-guided routing mechanism that operates on patch-to-region (P2R) and region-to-patch (R2P) transformations. This mechanism adaptively activates the most relevant experts for each spatial location according to its attention profile, enabling the model to capture region-level semantics and local context while preserving global consistency, thereby providing a stronger inductive bias for adversarially robust ViT representations. Extensive experiments demonstrate that our ReMoE substantially improves the adversarial robustness of ViTs with only marginal additional computational cost.
Paperid: 2560,   Poster  
Authors: Mosam Dabhi, Irhas Gill, Laszlo Jeni, Simon Lucey
Title: 2D-LFM: Lifting Foundation Model without 3D supervision
Abstract: Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with objectspecific 3D structure: the fine-grained geometry implied by an object’s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D lifting that rivals or exceeds recent 3D-supervised approaches, revealing that landmark-based lifting remains a powerful and under-exploited paradigm for 3D understanding.
Paperid: 2561,   Poster  
Authors: Yaomin Cai, C.L.Philip Chen, Shiting Xu, Haiqi Liu, Tong Zhang
Title: Region-Aware Instance Consistency Learning for Micro-Expression Recognition
Abstract: Microexpression Recognition (MER) is challenging due to the subtle motion. Existing methods heavily rely on the onset/apex pair to capture the most discriminative motion clues. This paradigm struggles with labor-intensive apex annotation and effective utilization of data. In this paper, we propose a novel paradigm for MER that eliminates the need for expensive apex annotations while effectively capturing subtle motion dynamics. Our key insight is that frames within the sequence exhibit spatially consistent and intensity varied motion cues relative to the onset frame. Motivated by this, our method treats each sequence as a set of multiple onset/near-median motion instances. To fully exploit weaker motion information conveyed by these varied instances, our framework introduces an Instance Region Consistency (IRC) module that enforces visual attention consistency on similar facial activation regions across different instances within the same set. Furthermore, we present a Multi-Region Discovery (MRD) module with self-supervised learning to expand attention on more subtle activation regions which are typically neglected. Extensive experiments on four public micro-expression datasets demonstrate that our proposed approach surpasses state-of-the-art methods without using any apex frame annotations.
Paperid: 2562,   Poster  
Authors: Xiao Zhang, Ruoxi Jiang, Will Gao, Rebecca Willett, Michael Maire
Title: Residual Connections Harm Self-Supervised Abstract Feature Learning
Abstract: We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification improves linear probing accuracy for both, notably increasing ImageNet accuracy from 67.8% to 72.7% for MAEs with a VITB/16 backbone, while also boosting generation quality for diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.
Paperid: 2563,   Poster  
Authors: Xi Liu, Weiwei Sun, Joe Ren, Christopher Broaddus, Siyu Huang, Laurent Guigues
Title: HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
Abstract: Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparseview 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis.
Paperid: 2564,   Poster  
Authors: Danrui Li, Jiahao Zhang, Bernhard Egger, Moitreya Chatterjee, Suhas Lohit, Tim Marks, Anoop Cherian
Title: AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Abstract: Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.
Paperid: 2565,   Poster  
Authors: Youyi Zhan, He Wang, Tianjia Shao, Kun Zhou
Title: High-Fidelity Mobile Avatars with Pruned Local Blendshapes
Abstract: We propose a method to reconstruct highfidelity human avatars from multi‑view video that can run on mobile devices. Many works can model high‑quality Gaussian-based full-body avatars from multi‑view video. However, these methods require heavy computation to obtain pose‑dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose‑dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it running on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high‑quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.
Paperid: 2566,   Poster  
Authors: Yiheng Zhang, Zhaofan Qiu, Zunxu Liu, Yingwei Pan, Ting Yao, Tao Mei
Title: EvoID: Reinforced Evolution for Identity-Preserving Video Generation
Abstract: We present EvoID, a novel framework that reformulates IdentityPreserving Video Generation as a self-evolving process through Reinforcement Learning. Moving beyond the static paradigm of imitation learning, EvoID enables a generative model to actively learn and optimize the complex trade-offs between identity fidelity, motion naturalness, and temporal coherence. At the heart of our EvoID is a dynamic, dual-path reward mechanism, which acts as an intrinsic critic by adaptively combining objective metric indicators and MLLM-based holistic quality assessment. This allows the model to "evolve" its generation strategy, focusing on different aspects of quality at different stages of training. To ensure stable and coherent evolution, we anchor the exploring Student model with a frozen Teacher, preserving robust world priors while allowing for creative refinement when generating videos. Extensive experiments demonstrate the superiority of our proposal, and EvoID achieves the total score of 0.687 on the Human-Domain of OpenS2V-Eval dataset, surpassing 0.658 of the open-source VACE and 0.653 of the commercial Hailuo. Moreover, EvoID also obtains a new record of 0.718 on our newly minted MLLM-based metric, prioritizing human perception and more comprehensively reflecting video quality.
Paperid: 2567,   Poster  
Authors: Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu
Title: BAMI: Training-Free Bias Mitigation in GUI Grounding
Abstract: GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpotPro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed Masked Prediction Distribution (MPD) attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce Bias-Aware Manipulation Inference (BAMI), which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9% to 57.8%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness.
Paperid: 2568,   Poster  
Authors: Zhengzhong Zhu, Liangjin Liu, Pei Zhou, Shiquan min, Jiangping Zhu
Title: Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics
Abstract: Spatial transcriptomics provides both spatial coordinates and gene expression profiles, enabling the study of tissue organization and cellular heterogeneity. Despite recent progress, current spatial clustering methods still face two major limitations. First, representations learned from spatial and expression views often differ due to viewspecific noise and incomplete structural information. Without enforcing sample-level cross-view consistency, embeddings from the two views may not correspond to the same biological identity, reducing discriminative capability. Second, existing approaches lack effective semantic-level supervision. Although node embeddings capture local neighborhood patterns, they do not explicitly reflect high-level semantic structures. Prototype-based modeling can provide such semantic abstraction, yet current methods seldom align prototypes with node representations, leading to weak semantic consistency. To overcome these issues, we propose a Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics (MHAL). At the sample level, MHAL introduce positive sample alignment to enforce consistency between spatial and expression embeddings. At the semantic level, MHAL design prototype level contrastive learning, where prototypes act as semantic anchors and guide the formation of coherent cluster structures. Together, these two alignment mechanisms progressively ensure both local consistency and global semantic discrimination. Extensive experimental results demonstrate that the proposed hierarchical contrastive multi-view clustering method achieves competitive performance in spatial domain identification compared to other state-of-the-art methods.
Paperid: 2569,   Poster  
Authors: Guangxun Zhang, Mason Haberle, Davi Geiger
Title: Stable Mean Flow: LyapunovInspired One-Step Flow Matching
Abstract: The Mean Flow Matching algorithm is the stateof-the-art for one-step generative models. Building on this idea, we propose the Stable Mean Flow algorithm and introduce a Lyapunov-inspired stability regularizer that enforces local non-expansivity of the single-step transport map. This design guarantees uniqueness of characteristics and bounds trajectory drift. We conduct experiments that show improved output quality and convergence speed over Mean Flow. Moreover, we establish explicit upper bounds on error growth for both one-step and multi-step generation.
Paperid: 2570,   Poster  
Authors: David Pujol-Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray
Title: Beyond Caption-Based Queries in Video Moment Retrieval
Abstract: Current Video Moment Retrieval (VMR) models are trained on videos paired with captions, which are written by annotators after watching the videos. These captions are used as textual querieswhich we term caption-based queries. This annotation process induces a visual bias, leading to overly descriptive and fine-grained queries, which significantly differ from the more general search queries that users are likely to employ in practice. In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets---i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search-queries, and (ii) a multi-moment gap, caused by the shift from single moment to multi-moment queries. We also identify a critical issue in these architectures---an active decoder-query collapse---as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries.
Paperid: 2571,   Poster  
Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, YiXing Yao, Yaqiang Wu, Basura Fernando, Jun Liu
Title: Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
Abstract: While model merging has demonstrated remarkable success across diverse domains for large language models (LLMs), its application to visionlanguage models (VLMs) remains largely underexplored. Recent methods attempt to enhance VLM reasoning capabilities by integrating specialized LLM parameters through layer-wise merging. However, existing paradigms suffer from two critical limitations: (1) strict positional correspondence, which enforces rigid one-to-one layer alignment, and (2) uniform merging weights applied indiscriminately across all layers. These constraints fail to account for substantial functional disparities between corresponding layers in VLMs and LLMs, potentially misaligning incompatible layers and leading to detrimental parameter combinations.To address these, we propose Chain-of-Merging (CoM) framework that adaptively adjusts merging plans for different images and questions, including two key stages: (1) Adaptive Layer Matching, which identifies optimal layer pairings based on structural and semantic matching scores while filtering incompatible pairings, and (2) Dynamic Weight Merging, which determines layer-specific merging weights based on matching scores and employs spherical linear interpolation to minimize memory overhead.Extensive experiments demonstrate that CoM achieves substantial performance improvements, with Qwen2.5-VL-7B + Qwen2.5-Math-7B attaining a 4.4% average improvement on mathematical reasoning benchmarks while enhancing general visual understanding, significantly outperforming existing training-free methods.
Paperid: 2572,   Poster  
Authors: Hang Dai, Hongwei Fan, Han Zhang, Duojin Wu, Jiyao Zhang, Hao Dong
Title: FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
Abstract: The increasing need for augmented reality and robotics is urging for articulated object reconstruction with high scalability. However, the existing settings of reconstructing from discrete articulation states or casual monocular video need nontrivial axes alignment or suffer from insufficient coverage, limiting the applications. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, free-moving part segmentation discovers rigid parts from relative motion in unconstrained capture. The joint estimation module proposes a noise-resistant approach to recover joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry and joint angles of the articulated object. We perform experiments on two benchmarks and real-world free-moving articulated objects. Experiments show that FreeArtGS consistently outperforms prior methods in free-moving articulated object reconstruction and remains competitive in the similar previous setting, underscoring the potential of FreeArtGS to serve as an engine for realistic articulated asset building. Code and data will be released.
Paperid: 2573,   Poster  
Authors: Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhentao Guo, Zijian Hu, Ruilin Luo, Ruizhe Chen, Sontao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang
Title: CodePercept: Code-Grounded Visual STEM Perception for MLLM
Abstract: When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium—executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code will be available soon.
Paperid: 2574,   Poster  
Authors: Yuekun Dai, Zhoutong Zhang, Shangchen Zhou, Nanxuan Zhao
Title: Linear Image Generation by Synthesizing Exposure Brackets
Abstract: The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a displayreferred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. Generating linear images, however, is quite challenging. Pre-trained VAEs in latent diffusion models struggle to reconstruct linear images due to their higher dynamic range and bit depth, where extreme highlights and shadows cannot be simultaneously preserved. To this end, we represent a linear image as a sequence of exposure brackets—linear sub-images, each capturing a specific portion of the overall dynamic range. Based on this representation, we propose a new DiT-based flow-matching architecture to generate exposure brackets, which can be post-processed to produce a high-quality linear image. We further demonstrate that our approach enables downstream applications such as linear image editing and conditional linear image generation through ControlNet guidance.
Paperid: 2575,   Poster  
Authors: Gong Chen, Chaokun Zhang, Xinyan Zhao
Title: WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Abstract: Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixedrate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency.Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4% with only 0.5% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across what and where to share is the key to achieving efficient collaborative perception.
Paperid: 2576,   Poster  
Authors: Junhyoung Lee, Seongwoon Jo, Jeong-Hun Park, Yeonji Ryou, Jeongha Yang, Jangho Kim
Title: Nonlinear Color Transfer via Learnable Bezier Flows
Abstract: Color transfer aims to match the color distribution of a content image (source) to that of a style image (target) while preserving structure and perceptual realism. Yet modulationbased flow models such as ModFlows often produce trajectory misalignment and artifacts because they rely on strictly linear transport paths. We propose NCT, a nonlinear color transfer framework that replaces linear paths with Bezier trajectories, enabling smooth, nonlinear, and perceptually coherent color transfer. This parameterization lets the transport bend toward plausible intermediate color regimes, improving content–style alignment and reducing chromatic distortion. We further incorporate a Mixture of Experts (MoE) module in the encoder to select trajectory experts for different chromatic regimes, improving generalization to heterogeneous data with complex illumination and materials. Experiments show that NCT reduces artifacts and achieves more stable color transfer than prior flow-based methods, especially on 3D-rendered or highly textured images. The code is provided in supplementary materials.
Paperid: 2577,   Poster  
Authors: Kewei Gao, Jiayi Xie, Zhengda Shen, Weijun Qin, Lingxiang Jia, Kejia Chen, Zunlei Feng, Yijun Bei
Title: Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
Abstract: Labelfree image anomaly detection is difficult because anomalies must be separated from intra-normal variability. Diffusion models learn a manifold for normal data, and, under the common assumption that off-manifold anomalies are harder to generate and yield larger prediction errors, many methods build detectors from prediction residuals; yet reverse-process stochasticity and complex but normal structure also produce large residuals, so magnitude alone is non-diagnostic. To clarify what is recoverable from such noisy residuals, the theory examines how residual signals propagate through later reverse steps, showing that variability consistent with normal statistics is gradually absorbed toward stationarity, whereas anomalous regions retain an additional non-stationary signal that persists. Building on this insight, the Residual–Evolution Field (REF) isolates this persistent signal, with labeled source data calibrating the extractor and Cross-domain Field Alignment (CFA) transferring it to unlabeled targets. A theoretical framework with formal guarantees is established, and experiments across multiple benchmarks under substantial domain shifts demonstrate state-of-the-art performance, improving over strong baselines by 2.01–14 percentage points (pp).
Paperid: 2578,   Poster  
Authors: Omar Elezabi, Eduard Zamfir, Zongwei Wu, Radu Timofte
Title: Language-Free Generative Editing from One Visual Example
Abstract: Textguided diffusion models have advanced image editing by enabling intuitive control through language.However, despite their strong capabilities, we surprisingly find that SOTA methods struggle with simple, everyday transformations such as rain or blur. We attribute this limitation to weak and inconsistent textual supervision during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We contend that the capability for diffusion-based editing is not lost but merely hidden from text.The door to cost-efficient visual editing remains open, and the key lies in a vision-centric paradigm that perceives and reasons about visual change as humans do, beyond words.Inspired by this, we introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns conditioning signals directly from visual examples for precise, language-free image editing.Given a paired example—one image with and one without the target effect—VDC derives a visual condition that captures the transformation and steers generation through a novel condition-steering mechanism.An accompanying inversion-correction step mitigates reconstruction errors during DDIM inversion, preserving fine detail and realism.Across diverse tasks, VDC outperforms both training-free and fully fine-tuned text-based editing methods.Code and models will be publicly released upon acceptance.
Paperid: 2579,   Poster  
Authors: Hanyu Chen, Haiwei Wu, Jinyu Tian, Jianqing Li, Jiantao Zhou
Title: Forensic-Friendly Image Manipulation via Controllable Latent Diffusion
Abstract: With diffusion models demonstrating superior capabilities in image editing, more users now rely on online servers for content manipulation via textual prompts rather than traditional offline tools. Despite servers attempting to prevent the proliferation of maliciously edited content via active defense like watermarking, this approach is not conducive to passive detection by thirdparty forensics. To address this limitation, we propose a plug-and-play controllable denoising termed Forensic-Friendly Image Manipulation (FFIM), which simultaneously satisfies user editing requirements while facilitating forensic analysis. Specifically, FFIM comprises three phases: Controllable Projection, Implicit Detection, and Explicit Guidance. Phase I enforces orthogonality between the variance of random noise and image features to ensure clear demarcation between the edited and unedited regions. Phase II implicitly evaluates whether this demarcation meets detection requirements; if not, Phase III explicitly introduces a surrogate detection model and adversarially adjusts the random noise to maximize the feature discrepancy between these regions. Experiments across four datasets demonstrate the superiority of FFIM over baseline methods, achieving up to +6.6% F1 in pixel-level localization and +27.3% AUC in image-level detection. Importantly, these forensic gains are attained without compromising visual quality, as evidenced by comparable manipulation in both subjective user studies and objective quality assessments. We envision that the proposed method will be widely adopted by generative AI service providers, enabling more comprehensive information authenticity from a passive defensive standpoint.
Paperid: 2580,   Poster  
Authors: Yuanfan Zheng, Kunyu Peng, Xu Zheng, Kailun Yang
Title: Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
Abstract: Crossdomain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting and propose the Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin in the Euler rotation space, thereby enhancing viewpoint-invariant semantic representation for panoramic geometry and improving generalization across novel viewpoints. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The source code will be released.
Paperid: 2581,   Poster  
Authors: Kewei Wu, Chong Liang, Zhao Xie, Dan Guo
Title: Progressive mask distillation for self-supervised video representation
Abstract: Masked visual modeling is a selfsupervised learning task that does not use visual annotations. It aims to learn discriminative representations via a mask-reconstruction task. A single mask ratio in reconstruction may fail to capture complex semantics, which motivates dynamic masking strategies. In this work, we propose Progressive Mask Distillation (PMD), which utilizes dynamic mask ratios to facilitate progressive semantic learning from easy to hard. PMD integrates three key components: a progressive student distiller, a difficulty-aware region enhancer, and a cross-layer feature aligner. First, to capture dynamic visual semantics, we design a progressive student distiller that trains multiple student models with progressively increasing mask ratios. The early-phase student (with a low mask ratio) learns easy, low-level semantics from more visible tokens. This learned knowledge then guides the next-phase student (with a higher mask ratio) to capture hard, high-level semantics from fewer visible tokens. This progressive distillation mechanism enhances detail reconstruction at a high mask ratio. Second, to alleviate insufficient learning of semantic regions, we design a difficulty-aware region enhancer. We first smooth the region reconstruction loss to reduce large fluctuations across training epochs. The smoothed loss is then used to learn region-level weights, prioritizing accurate learning of regions with large reconstruction losses. Third, to further bridge the semantic gap across network layers, we design cross-layer feature alignment. This module aligns features across shallow, middle, and deep encoder layers, ensuring that shallow-layer features incorporate semantic information from deeper layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Something-Something V2, Kinetics-400, UCF-101, and HMDB-51 datasets.
Paperid: 2582,   Poster  
Authors: Xinhang Liu, Pedro Miraldo, Suhas Lohit, Huaizu Jiang, Naoko Sawada, Yu-Wing Tai, Chi-Keung Tang, Moitreya Chatterjee
Title: Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
Abstract: Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feedforward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emphspacetime representation that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evidence to refine the latent spacetime representation. When queried for any time instant, whether before, at, or beyond the timestamp of the last update. A readout procedure predicts temporally conditioned point maps and camera parameters describing the scene geometry at the queried time. Unlike prior approaches for online dynamic scene reconstruction that estimate each frame’s point map solely at the timestamp of the last observed frame, Point4Cast achieves coherent reconstruction across any queried time. Empirical evaluations show that \emphPoint4Cast achieves state-of-the-art performance on streaming dynamic scene reconstruction and forecasting benchmarks, across multiple challenging datasets, while providing scene flow estimation and forecasting for free. The code will be released publicly.
Paperid: 2583,   Poster  
Authors: Zihang Lai, Eldar Insafutdinov, Edgar Sucar, Andrea Vedaldi
Title: WarpTracker: Tracking by Warping instead of Correlation
Abstract: Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. Stateof-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting its scalability and efficiency. In this paper, we propose WarpTracker, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design established long-range correspondences without computing feature correlation. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
Paperid: 2584,   Poster  
Authors: Tao Gong, Dayong Wang, Qi Chu, Bin Liu, Nenghai Yu
Title: Cross-modal Representation Learning for Diffusion-generated Image Detection
Abstract: The astonishing proficiency and unprecedented level of realism of diffusion models in creating and manipulating images have undoubtedly drawn concerns.Many methods have been proposed to detect generated images. Typically, they usually take RGB images as input, and use backbones like ResNet, CLIP visual encoder to extract features. Even though these backbones are capable to detect fake images, they are mainly designed to extract the highlevel semantic information, rather than inherently designed for fake image detection. To this end, in this paper, we want to optimize the embedding space tailored for detecting fake images via representation learning. We notice that Neighboring Pixel Relationships (NPR) is capable to capture the intrinsic forgery clues, which means that NPR may be a good input to perform representation learning that aims at learning the embedding space tailored for detecting fake images.Therefore, we leverage features from both RGB modality and NPR modality to perform two proposed representation learning methods, Cross-Modal Contrastive Learning (CMCL) and Cross-Modal Mutual Distillation (CMMD), in order to learn the forgery-aware embedding space. The CMCL boosts the discrimination of features between real and fake images, while the CMMD simultaneously transfers the learned knowledge between two modalities, being able to learn compact features within the intra-class. CMCL and CMMD work collaboratively so that each modality learns a more comprehensive forgery-aware representation to distinguish real and fake images.Extensive experiments on GenImage, DRCT-2M, and Co-Spy-Bench datasets show that our method achieves state-of-the-art results.
Paperid: 2585,   Poster  
Authors: Rui Zhu, Liang Bai, Yanming Guo, Yirun Ruan, Tianyuan Yu, Zhihe Lu
Title: Controllable Federated Prompt Learning at Test Time
Abstract: Federated Prompt Learning (FPL) has recently attracted increasing attention for its ability to leverage largescale vision-language models such as CLIP within federated learning frameworks. While existing studies have advanced FPL through personalization strategies to enhance client-specific performance, personalized models often suffer severe degradation when deployed across unseen domains due to distribution shifts.In this paper, we take the first step toward exploring Test-Time FPL (TTFPL), aiming to bridge the cross-domain performance gap with minimal effort, requiring only unlabeled target-domain data. We propose COTE, a tri-prompt controllable TTFPL framework that dynamically balances three complementary prompts: the global prompt from standard FPL, the local prompt from personalized FPL, and the frozen CLIP prompt.Specifically, we introduce a novel confidence-guided Model-Data Alignment (MoDA) metric in COTE that quantifies alignment at both macro and micro levels, capturing the consistency between model predictions and data distributions. By integrating MoDA with model confidence, COTE adaptively adjusts the contribution of each prompt at test time, enabling robust generalization across heterogeneous clients and unseen domains without requiring labeled data.Extensive experiments on multiple benchmark datasets demonstrate that our method consistently improves target-domain performance, setting a new direction for adaptive FPL.
Paperid: 2586,   Poster  
Authors: Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
Title: Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
Abstract: When obtaining visual illustrations from text descriptions, today’s methods take a description with a single text context—a caption, or an action description—and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitcha-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.
Paperid: 2587,   Poster  
Authors: Bing Li, Qiang Wang, JUNDA LU, Le Zhang, Yun Liu, Ce Zhu, Wei Cui
Title: WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
Abstract: WiFi sensing offers passive and privacypreserving perception that complements vision-based sensing, but its performance degrades sharply under domain shifts caused by changes in environment, users, or hardware. This challenge is exacerbated in real-world deployments where source data are unavailable, motivating test-time adaptation (TTA) as a practical solution for self-calibration using only unlabeled target samples. We introduce WiTTA-Bench, the first comprehensive benchmark for WiFi TTA, covering 20 representative methods, two adaptation protocols (OTTA and TTDA), and three major physics-induced shifts in WiFi: cross-environment, cross-subject, and cross-device. Furthermore, we contribute a new dataset featuring paired recordings from heterogeneous devices to bridge the cross-device gap. Extensive experiments reveal three key insights unique to WiFi sensing: (i) WiFi domain shifts exhibit a physics-induced hierarchy: environmental changes alter multipath statistics, subject variation perturbs temporal–spectral geometry, and hardware differences reshape the entire feature manifold; (ii) OTTA and TTDA are complementary: lightweight OTTA handles mild statistical drift, while TTDA is necessary to correct deep, hardware-induced structural distortions; and (iii) OTTA is hyperparameter-robust and scales linearly with source quality, whereas TTDA is more sensitive due to recursive self-training. WiTTA-Bench establishes the first systematic foundation for adaptive, robust, and deployable WiFi sensing under realistic wireless conditions.
Paperid: 2588,   Poster  
Authors: Ziqian Yang, Xianglin Qiu, Xinqiao Zhao, Xiaolei Wang, Quan Zhang, Jimin Xiao
Title: Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation
Abstract: Weakly Supervised Semantic Segmentation (WSSS) typically utilizes Class Activation Maps (CAMs) to provide the pixelwise localization. However, CAMs tend to activate only the most discriminative regions, leading to suboptimal WSSS performance. Although existing CAM refinement methods leverage pair-wise relations in affinity to expand the activation regions, these affinities derived from Vision Transformer (ViTs) exhibit a smoothing property, neglecting crucial high-frequency relations and failing to accurately refine object boundaries. In this work, we propose the Dual Frequency-Aware framework (DFA) to address this limitation. Specifically, the Low-Frequency-Aware Alignment (LFAA) generates low-frequency-aware affinity that captures salient semantic relations to enhance object interior semantic consistency on CAMs, while the High-Frequency-Aware Rectification (HFAR) module produces high-frequency-aware affinity that models precise relations to preserve object boundary structure on CAMs. By effectively integrating these two complementary affinities, we design a novel Frequency-Guided (FG) CAM Generation based on Optimal Transport theory, which significantly omits the complex refinement process. Extensive experiments demonstrate that our DFA framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.
Paperid: 2589,   Poster  
Authors: Chenxu Bai, Boyu Li, Peiqi Duan, xinyu zhou, Hanyue Lou, Boxin Shi
Title: AE2VID: Event-based Video Reconstruction via Aperture Modulation
Abstract: Eventbased video reconstruction seeks to recover high-speed, high-dynamic-range videos from event streams. While existing approaches rely exclusively on motion-triggered events, these events are inherently sparse and primarily capture dynamic regions. Therefore, they often suffer from error accumulation and degraded quality in regions with few events. In this work, we introduce aperture-modulation-triggered events as a complementary mechanism to enrich the captured scene information. Specifically, we periodically modulate the aperture to actively generate dense event signals, thereby encoding intensity cues even in static or low-motion regions. Building upon this idea, we design an AE2VID framework that jointly leverages aperture-modulation-triggered and motion-triggered events to enhance the fidelity of predictions. The proposed framework consists of two subnetworks for the dedicated processing of both event types. We further collect a real dataset and validate the effectiveness of our method. Extensive experiments show our superiority over state-of-the-art methods.
Paperid: 2590,   Poster  
Authors: Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio
Title: Lipschitz Optimization for Formal Verification of Homographies
Abstract: The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safetycritical domains such as healthcare, aerospace, and autonomous vehicles. However, current approaches are confined to incomplete statistical verification, or robustness to \ell_p-norm or affine transforms which represent a limited subset of perturbations to the image formation process.In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. While our formulae are grounded in the vision-based landing problem, they generalize to other scenes with predominantly planar features (e.g., augmented reality, traffic signs). This enables formal verification against a broad class of projective geometry transformations, without requiring simulation or complex modeling of image formation.We first validate our implementation, and show up to 89% speedup and 7% tighter bounds than the latest work. We then evaluate our method on established benchmarks from the VNN Competition, and highlight key model vulnerabilities to 3D transforms. Finally, we perform the first formal verification of a vision-based landing system under 3D perturbations, addressing a key challenge in the regulatory certification of learned models for real-world systems. The data and code used for this paper are publicly available.
Paperid: 2591,   Poster  
Authors: Ludvig Dillén, Magnus Oskarsson, Viktor Larsson
Title: Sparse–View Localization via Online Neural 3D Regression
Abstract: We present ON3R, an onlinetrained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a K-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on K\leq10 database images. The code, data splits, and SfM models will be made available for full reproducibility.
Paperid: 2592,   Poster  
Authors: Hongjin Lian, Jian Ma, Hongjie Chen, Jia Li, Ruizhen Hu, Yu-Kun Lai, Kun Li
Title: CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
Abstract: Largescale floorplan generation is critical for virtual space planning and architectural simulation. Although existing methods have shown success in generating small-scale floorplans with simple room shapes, they struggle to handle the complex room connections and irregular room shapes that arise in large-scale floorplans. In this paper, we propose CG-Floor, a centroid-guided hierarchical framework that explicitly decouples topology and geometry to address these issues. We first introduce the size-aware semantic centroid heatmap, derived from predicted room centroids, which provides a structured representation to precisely guide the effective generation of a coarse-to-fine floorplan generator while ensuring semantic alignment. Additionally, we train a vector quantized codebook of floorplans with complex room shapes to capture the diversity of room shapes and employ a latent diffusion transformer to generate large-scale floorplans featuring non-Manhattan room shapes. CG-Floor achieves state-of-the-art performance on the large-scale MSD dataset, and supports 3D floorplan conversion and editing, demonstrating the practicality of our approach.The code will be publicly available for research purposes.
Paperid: 2593,   Poster  
Authors: Yinan Han, Qing-Yuan Jiang
Title: PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
Abstract: Openset semi-supervised learning (OSSL) has achieved notable progress in exploiting unlabeled data, yet most existing methods overlook the distinct sensitivities of in-distribution (ID) and out-of-distribution (OOD) samples to semantic-preserving perturbations, resulting in unreliable OOD sample filtering. We address this gap by leveraging the behavioral difference between ID and OOD samples under perturbations and extend it into a representation-level signal for reliable OOD filtering. Specifically, we propose a novel filtering strategy, Perturbation-Aware Filtering (PAF), which identifies OOD samples by measuring the representation instability under semantic-preserving perturbations. We then integrate PAF into a carefully designed two-stage training framework, allowing the model to exploit abundant unlabeled data in the first stage and gradually adapt to the open-set setting with limited labels in the second stage. Extensive experimental results on widely-used OSSL benchmarks demonstrate that our proposed PAF approach achieves superior performance compared to state-of-the-art OSSL methods. The source code will be released publicly.
Paperid: 2594,   Poster  
Authors: Yan Wang, Fuyuan Cao, Xingwang Zhao
Title: Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis
Abstract: Disentanglementbased methods for learning shared representations are widely used in multimodal sentiment analysis. However, most of them adopt intra-modal reconstruction strategy and rely on similarity losses to align shared representations, while often ignoring potential emotional conflicts across modalities within the same sample, thereby distorting the shared semantics. To address these issues, we propose a Conflict-aware Adaptive Cross-Reconstruction approach (CACR). First, we formally define emotional conflict and design a conflict-aware weighting strategy. This strategy calculates sample-level conflict scores based on modality consistency metrics and maps them into dynamic weights for the cross-reconstruction loss of each modality. Second, based on this, we construct a cross-reconstruction module: for each modality, reconstructs its representation by leveraging its own specific features and the shared features of the other modalities, adaptively weighting each cross-reconstruction term with the aforementioned weights, thereby achieving implicit alignment of shared representations while mitigating semantic ambiguity. Extensive experiments on three widely used benchmarks show that CACR outperforms existing state-of-the-art methods on six evaluation metrics, demonstrating its effectiveness in handling modality-level emotional conflict.
Paperid: 2595,   Poster  
Authors: Dongxin Xie, Yan Huang, Yong Xu, Hui Ji
Title: Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
Abstract: Atmospheric turbulence severely degrades longrange images with distortions and blur, hindering downstream applications. While supervised methods rely on synthetic data with limited real-world generalization, existing unsupervised approaches often ignore the underlying physics, leading to suboptimal restoration. We propose TMFS, an optimization-based and physically-grounded approach for unsupervised turbulence mitigation. The method operates by optimizing an imaging model with frame-shared degradation parameters under physically-motivated regularization. Inspired by sampling procedures in physical simulators, the degradation parameters are further decomposed into a frame-shared correlation function and per-frame noise maps. TMFS gains a strong inductive bias that improves generalization and mitigates overfitting. In extensive experiments, TMFS achieves state-of-the-art results among unsupervised methods. In contrast, supervised methods show a significant domain gap on real data, thereby validating the advantage of our physics-aware, unsupervised approach.
Paperid: 2596,   Poster  
Authors: Ashish Kumar, A. N. Rajagopalan
Title: Disco-GS: Gaussian Splatting in Dynamic Color Lighting
Abstract: Recent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in “disco lights” commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose DiscoGS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end-to-end framework that does not rely on any prior knowledge, such as color values, ambient lighting conditions, or scene properties. It effectively handles both global and spatially localized transient color variations. It also enables controllable brightness manipulation of the canonical scene, facilitating applications such as simulating low-light and well-lit scene conditions. To the best of our knowledge, Disco-GS is the first method to simultaneously perform 3D scene reconstruction and canonical appearance recovery from inputs captured under artificially varying, disco-style colored light. To enable quantitative and qualitative evaluations, we also introduce the Disco dataset, a collection of 25 videos of real-world scenes exhibiting diverse and random color variations. The dataset will be released. Extensive experiments demonstrate the robustness and fidelity of Disco-GS.
Paperid: 2597,   Poster  
Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhixin Yang, Jiwen Lu
Title: DVGT: Visual Geometry Transformer for Autonomous Driving
Abstract: Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a drivingtargeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets, including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms the other geometry prediction models on various scenarios.
Paperid: 2598,   Poster  
Authors: Yeliduosi Xiaokaiti, Yakun Chang, Yang Bai, Zhaojun Huang, Peiqi Duan, Boxin Shi
Title: 240FPS Stereo Vision from Monocular Mixed Spikes
Abstract: Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on datadriven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding methodology for achieving high-quality stereo vision: An efficient least-squares based baseline reconstruction followed by a deep learning refinement module. Experimental results demonstrate that our approach achieves 240FPS binocular video reconstruction with superior accuracy compared to monocular systems, while maintaining the hardware compactness and data efficiency.
Paperid: 2599,   Poster  
Authors: Weiran Wang, Jialing Wu, Yaqi Chang, Gang He, Li Xu, Chang Wu, Yunsong Li
Title: ALLNet: Multi-task Dense Prediction for Degraded Images
Abstract: Multitask dense prediction aims to simultaneously address multiple pixel-level tasks through a unified network for visual scene understanding. However, adverse environmental conditions limit the generalization and practicality of such tasks. To address this, we propose ALLNet, a novel framework that effectively explores degradation patterns and integrates multi-task collaborative information. Specifically, we design a MoE-based Mixture of Adaptive Experts (MaE) restoration component network that enhances degradation features through dynamic routing and guides task-specific feature extraction. Furthermore, we formulate a Task-aware Collaborative Refinement (TCR) module to capture global semantic correlations and cross-task dependencies, facilitating bidirectional collaboration between restoration and task-specific features on degraded images. To the best of our knowledge, this represents the first attempt at multi-task dense prediction under image degradation. Experimental results on degraded NYUD-v2 and PASCAL-Context benchmarks demonstrate that our architecture significantly outperforms existing methods in degraded scenarios.
Paperid: 2600,   Poster  
Authors: Lance Moore, Aranyo Mitra, Ryan Truong, Karoline Kallis, Kelly Kisling, Sandra Meyers, Nuno Vasconcelos
Title: D2T2 - Multimodal Automated Planning for Brachytherapy
Abstract: Brachytherapy is a complex radiation oncology problem that requires the simultaneous prediction of radiation dose, which is used for treatment planning, and a set of machine parameters, known as dwell times, used for treatment delivery. We propose Direct Dwell Time Transformer (D2T2), the first deep learning architecture that directly predicts dwell times during dose prediction. D2T2 is a two stage model, where the first stage predicts a vector of dwell times and the second implements the physical model of radiation delivery, namely a linear combination of radiation dose kernels. Besides the the automatic prediction of dwell times, this has the benefit of constraining the model to make physically plausible dose predictions, when trained endto-end. To enhance this training, we also propose a new loss function, denoted as the gamma loss, based on the prediction of the gamma index, which is the gold standard of dose comparisons. This is implemented by training a model to predict the latter using a synthetic dataset of groundtruth and predicted dose pairs. We train D2T2 on a large dataset of ~5,000 clinical brachytherapy plans---the largest such dataset to date---spanning gynecological, breast, and other treatment sites. Results demonstrate that D2T2 outperforms existing methods in both accuracy and speed. Notably, D2T2 produces deliverable plans and physically valid dose distributions in a single forward pass, for any application of brachytherapy, hours faster than manual planning and minutes faster than more recent automated methods.
Paperid: 2601,   Poster  
Authors: Bofan Chen, Hongyu Zhu, Yi He, Sichu Liang, Shi-Lin Wang
Title: Learning Forgery-Aware Lip Representations Without Forgery Priors
Abstract: Visual Speaker Authentication (VSA) verifies identity by analyzing lip dynamics during prompted speech, offering enhanced privacy compared to fullface methods while maintaining discriminability for high-security applications. However, recent advances in talking face generation (TFG) have enabled realistic forgeries that closely mimic lip dynamics in sync with speech, posing severe threats to VSA systems. Prevailing defenses rely heavily on supervised classifiers trained on known forgeries via empirical risk minimization, resulting in poor generalization to unseen attacks, dependency on continuously updated fake data, and complete failure in the absence of effective forgery priors. In this paper, we revisit the design of forgery detectors and argue that over-reliance on fake priors hinders the exploitation of rich authenticity signals inherently present in real videos. We propose a novel detector trained exclusively on authentic data, learning forgery-aware representations through three key components: (1) lightweight modules that capture forgery-indicative statistics from real videos; (2) an asymmetric contrastive objective that compacts real samples while repelling potential forgeries in representation space; and (3) a theoretically grounded regularizer that shapes real representations into a tractable, isotropic Gaussian. To support rigorous evaluation, we introduce a benchmark suite spanning diverse TFG forgeries. Across eight modern forgery attacks and ten state-of-the-art (SOTA) detectors, our method achieves over a 10% reduction in error rates while preserving identity-verification capability with minimal overhead, and demonstrates consistent gains on datasets that better emulate real-world scenarios.
Paperid: 2602,   Poster  
Authors: Xiao Liang, Huaizhi Tang, Feiyang Zhang, Shiji Yuan, Chun Hu, Dezhi Zheng, Kang Ma
Title: UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization
Abstract: Crossview geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and diversity of drone and ground perspectives, enabling more realistic and flexible evaluations of CVGL. Additionally, we propose Cross-Attention-based Matching Enhancement (CAME), a unified framework for CVGL. By dynamically aggregating contextual information from top-ranked candidates, CAME refines feature representations and enhances cross-view matching robustness. Experimental results show (1) The Proposed UniGeoRS benchmark is necessary for training and evaluating the CVGL model across all three platforms. (2) UniGeoRS improves model generalization across diverse conditions. (3) CAME consistently boosts performance across state-of-the-art CVGL approaches.
Paperid: 2603,   Poster  
Authors: Tao Jun Lin, Yujiao Shi, Hongdong Li
Title: VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment
Abstract: Aerialground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with extreme wide base-lines and limited overlap. We evaluate our method on challenging MatrixCity, ACC-NVS1 and ULTRRA ground-aerial pairs, demonstrating that optimizing with learned geometric priors can further improve the camera pose estimation across diverse altitudes and environment.
Paperid: 2604,   Poster  
Authors: Kangye Ji, Yuan Meng, Jianbo Zhou, Ye Li, Chen Tang, Zhi Wang
Title: Test-time Sparsity for Extreme Fast Action Diffusion
Abstract: Action diffusion excels at highfidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time.However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denosing timeteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing the features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the actions generated by the sparsified diffusion step by step.Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5×, achieving lossless performance with an inference frequency of 47.5 Hz.
Paperid: 2605,   Poster  
Authors: Ligong Cao, Yeting Guo, Haoang Chi
Title: Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
Abstract: Temporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where oneshot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural–semantic perception and diffusion-guided refinement. The structural–semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation durations across varying temporal spans using a designed scale weight allocation network, and (2) semantic perception, which analyzes the semantic consistency within each modality through intra-modal distillation.In this way, it first suppresses low-quality forgery localization proposals, yielding a structurally and semantically reliable candidate set. Then a diffusion-based regression head further iteratively refines the candidates into precise and temporally coherent boundary trajectories.Extensive experiments on multiple TFL benchmarks demonstrate that our method achieves state-of-the-art performance.
Paperid: 2606,   Poster  
Authors: Qixiu Li, Xiang Zhu, Xiaoyong Li, Xiaolong Xu
Title: PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
Abstract: Ocean dynamics drive global climate patterns and extreme weather events, making accurate spatiotemporal forecasting essential for climate monitoring and marine operations. Traditional Global Ocean Forecasting Systems (GOFSs) offer high accuracy predictions, yet remain computationally expensive and fail to fully leverage growing historical data. Recent deep learning models have achieved notable success, but still face three fundamental challenges: (1) they homogenize ocean variables despite strong physical coupling via equationof-state relationships; (2) they neglect spherical geometry, resulting in severe distortions at high latitudes; and (3) they struggle to model multi-scale temporal dynamics. We introduce PhyOceanCast, a physics-informed diffusion model that overcomes these limitations through two key innovations. First, the Spherical Graph Attention Network for Multi-scale Ocean Coupling (SGAN-MOC) preserves spherical topology while enabling cross-variable interactions via heterogeneous encoding and k-hop-constrained attention. Second, the Physics-Informed Wavelet Temporal Coherence (PWTC) module that decomposes ocean dynamics across multiple scales with advection-diffusion constraints. PhyOceanCast forecasts 145 ocean variables, including temperature, salinity, and velocity fields, across 36 depth levels plus sea surface height. Extensive experiments demonstrate superior performance over diffusion, transformer, and hybrid baselines, promising a new paradigm for global ocean canonical variable forecasting. Code is available at supplementary materials.
Paperid: 2607,   Poster  
Authors: Ligeng Zou, Guihu Zhao
Title: Convolutional Neural Networks Driven by Content Similarity
Abstract: Although convolutional neural networks (CNNs) have continued to evolve in recent years, Transformers have become increasingly popular in the field of computer vision. In this work, we open a new avenue for CNNs, enabling them to aggregate information based on content similarity—an ability analogous to the selfattention mechanism. We innovatively adopt reverse thinking to transform the feature similarity between tokens into relative positional information: specifically, the closer the positions of two tokens are, the higher their feature similarity. This approach allows convolution operations to be indirectly transformed into an aggregation mode driven by content similarity. Experiments show that our proposed model, named Ego, achieves excellent performance across various tasks, underscoring the untapped potential of CNNs. Code and models will be made publicly available.
Paperid: 2608,   Poster  
Authors: Haipeng Fang, Yu Li, Fan Tang, Yixing Lu, Juan Cao, Sheng Tang
Title: ResCa: Residual Caching for Diffusion Transformers Acceleration
Abstract: Diffusion transformers have achieved remarkable progress in highquality image and video generation, but their computational overhead remains a significant challenge. Existing token reduction-based acceleration techniques, such as caching and merging, attempt to reduce this cost from both temporal and spatial perspectives, but often compromise generation quality by introducing non-updated or non-self denosing directions. In this paper, we propose Residual Caching (ResCa), a novel, training-free framework that introduces a proxy denoising perspective to overcome these limitations. ResCa achieves acceleration while maintaining a denoising trajectory that is both self and updated. The core idea is to perform true denoising on only one proxy token within each trajectory-based cluster, and use its computed multi-order residuals to guide the simulated denoising of all other tokens. ResCa can be seamlessly integrated into various diffusion models, including DiT, FLUX, and HunyuanVideo. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method, achieving up to a 5.5 times acceleration in GFLOPs while maintaining near-lossless generation quality on FLUX.
Paperid: 2609,   Poster  
Authors: Wenhao Sun, Ji Li, Zhaoqiang Liu
Title: Just-in-Time: Tuning-Free Spatial Acceleration for Diffusion Transformers
Abstract: Diffusion Transformers have established a new stateof-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel tuning-free framework that addresses this challenge by accelerating the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of "anchor'' tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7×\ speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
Paperid: 2610,   Poster  
Authors: Zhizhen Pan, Hesong Wang, Huan Wang
Title: QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
Abstract: Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2Bparameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps.Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3~4.9× memory reduction and up to 2.8× real hardware speedup over FP32.Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.
Paperid: 2611,   Poster  
Authors: Haoyang Cui, Hao Jiang, Yadong Mu
Title: ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts
Abstract: As an important research task of human cultural heritage, the restoration of artworks and calligraphy is of great significance.Seldom existing works have taken the multisource (i.e., fragments are not ensured to be from the same piece of artworks) fragment-oriented restoration task into account.We propose ShreddingNet, a coarse-to-fine two-stage pipeline for multi-source manuscript restoration that operates without restrictive conditions.The proposed coarse stage compares the features of each fragment, selecting top-K candidates and clustering fragments by source. This design leverages the key insight that erroneous matches rarely cross source boundaries, enabling high-precision clustering.The proposed fine-grained stage evaluates candidates, yielding matching scores and filters out erroneous matching pairs from the candidate set; producing more precise final matching pairs for global assembly. Experiments conducted on more than 4,000 images from two datasets demonstrate the average reconstruction F1-score achieves 98.37%, which is 5.72% higher than the current state-of-the-art method, confirming the method’s effectiveness and robustness.Source code is available in the supplementary material.
Paperid: 2612,   Poster  
Authors: Ni Tang, Shenghao nie, Xiaotong Luo, Yuan Xie, Yanyun Qu
Title: Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration
Abstract: Allin-one image restoration (AiOIR) methods have made remarkable progress in handling diverse degradations. However, their performance often deteriorates when the test distribution deviates from the training distribution. Exploring test-time adaptation for AiOIR is therefore crucial. To adapt a pre-trained AiOIR model to unseen degradation distributions without access to source data or retraining, two key challenges must be addressed: designing reliable pseudo-supervision and stabilizing adaptation. Observing that multiple degraded versions of the same scene should map to a consistent clean image, we propose Degradation-Consistent Test-Time Adaptation (DCTTA). DCTTA comprises three core components: (1) test-time redegradation generation, which leverages a diffusion-based generator to construct pseudo degraded–clean pairs for distribution alignment; (2) degradation-guided image restoration, which enforces domain adaptation via self-supervised consistency loss; and (3) test-time important parameter selection, which selectively updates degradation-sensitive parameters to ensure stable adaptation while preserving pre-trained knowledge. Extensive experiments across multiple tasks and challenging domain shifts demonstrate that DCTTA consistently outperforms state-of-the-art AiOIR baselines, achieving up to +4.57 dB PSNR improvement on the Rain100H dataset.
Paperid: 2613,   Poster  
Authors: Dongyue Lu, Rong Li, Alan Liang, Lingdong Kong, Wei Yin, Lai Xing Ng, Benoit Cottereau, Camille Chane, Wei Tsang Ooi
Title: EventDrive: Event Cameras for Vision–Language Driving Intelligence
Abstract: Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond framebased sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion where frame-based perception can become unreliable. However, existing event-aware vision–language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a modality-routing mixture-of-experts to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.
Paperid: 2614,   Poster  
Authors: Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon
Title: Multimodal Distribution Matching for Vision-Language Dataset Distillation
Abstract: Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision–language inputs, multimodal distillation must preserve representation quality and crossmodal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image–text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal alignment and discrepancy directions along with symmetric contrastive learning. Across image–text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
Paperid: 2615,   Poster  
Authors: Aiqiu Wu, Zhaofan Qiu, Ting Yao, Tao Mei
Title: PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
Abstract: Video SuperResolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel "pseudo" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the "illusion" of a single-step model—delivering the similar inference speeds and input-output content consistency—while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our "pseudo-single-step" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement.
Paperid: 2616,   Poster  
Authors: Mengyang Li, Pinlong Zhao
Title: DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
Abstract: The efficiency of hyperparameter optimization (HPO) is critical for deep learning, yet stateof-the-art methods share a fundamental flaw: they are difficulty-agnostic, treating all hyperparameter configurations homogeneously. This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. We first model optimization difficulty hierarchically. Then, departing from hand-crafted priors, we train a conditional diffusion model on 120,000 real learning curves, generating synthetic data with 2.3× higher fidelity. This data trains our difficulty-aware surrogate model and acquisition function to dynamically adapt the search strategy. Across 75 tasks, DABO reduces regret by 11-18% compared to the leading difficulty-agnostic method, ifBO. Our work establishes a new paradigm for HPO, shifting the focus from configuration-centric to difficulty-aware resource allocation to enable more robust and efficient optimization.
Paperid: 2617,   Poster  
Authors: Yabing Wang, Zhuotao Tian, Le Wang, Zheng Qin, Sanping Zhou
Title: Spatial Matters: Position-Guided 3D Referring Expression Segmentation
Abstract: 3D Referring Expression segmentation (3DRES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to concentrate on the target object under positional relationship guidance. Extensive experiments on two benchmark datasets, \ie, ScanRefer, and Multi3DRefer, validate the effectiveness of the proposed method Position3D.
Paperid: 2618,   Poster  
Authors: Wen Wen, Hao CHEN, Shiliang Zhang
Title: Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
Abstract: Lifelong person reidentification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches mainly focus on visual encoder distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision–language models can serve as a stable semantic anchor, offering consistent guidance throughout lifelong learning. To leverage the synergy between vision and text, we propose Prompt-Anchored vision–text Distillation (PAD), a unified framework that enhances semantic alignment and cross-domain generalization. On the textual side, we distill semantic prompts that maintain vision–text alignment under a fixed semantic coordinate system. On the visual side, an EMA-based teacher performs model distillation assisted by an adaptive prompt pool that allocates new slots for each incoming domain while freezing past ones, achieving both adaptability and memory retention. Extensive experiments demonstrate that our PAD substantially outperforms state-of-the-art methods across multiple LReID benchmarks.
Paperid: 2619,   Poster  
Authors: Rakshith Madhavan, Matteo Forlivesi, Marina Bertolini, Cristina Turrini, Federica Arrigoni, Luca Magri
Title: Homaloidal parametrization for detecting critical two-view configurations
Abstract: We consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an illposed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres) is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points on the critical surface. By exploiting the geometry of the so-called ``homaloidal net of conics'', we are able to design a simple and very practical test that requires the linear estimation of a quadratic transformation from image correspondences. Our test does not require a fundamental matrix in advance and turns out to be more stable than its closest competitor, as shown in our experiments on both synthetic and real-world degenerate configurations.
Paperid: 2620,   Poster  
Authors: Nicolas Michel, Maorong Wang, Jiangpeng He, Toshihiko Yamasaki
Title: Continual Distillation of Teachers from Different Domains
Abstract: Deep learning models continue to scale, with some requiring more storage than numerous datasets. We introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher; but also that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves crossdomain generalization. The code will be released upon acceptance.
Paperid: 2621,   Poster  
Authors: Mengnan Zhao, Lihe Zhang, BoWang BoWang, Tianhang Zheng, Hong Zhong, Geyong Min
Title: Mitigating Error Amplification in Fast Adversarial Training
Abstract: Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbationinvariant representations.However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness-oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation budget grows.In this work, we conduct a comprehensive analysis of how guidance strength affects model performance by modulating perturbation and supervision levels across distinct confidence groups.The findings reveal that low-confidence samples are the primary contributors to CO and the robustness–accuracy trade-off. Building on this insight, we propose a Distribution-aware Dynamic Guidance (DDG) strategy that dynamically adjusts both the perturbation budget and supervision signal. Specifically, DDG scales the perturbation magnitude according to the sample confidence at the ground-truth class, thereby guiding samples toward consistent decision boundaries while mitigating the influence of learning spurious correlations. Simultaneously, it dynamically adjusts the supervision signal based on the prediction state of each sample, preventing overemphasis on incorrect signals. To alleviate potential gradient instability arising from dynamic guidance, we further design a weighted regularization constraint.Extensive experiments on standard benchmarks demonstrate that DDG effectively alleviates both CO and the robustness–accuracy trade-off.
Paperid: 2622,   Poster  
Authors: Yutao Qin, Gang Dai, Yifan Zhang, Youwei Han, Qisheng He, Shuangping Huang
Title: Towards Human-Like Robot Handwriting via Contour-Aware Generation
Abstract: Empowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure, while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, humanlike handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character dataset with refined stroke contour annotations. To tackle the technical challenge, we propose Graph-based Handwriting Trajectory Reconstruction (G-HTR), a novel method using contour-aware graphs to jointly model stroke contour and character structure. We use a Graph Neural Network to capture structural relationships among nodes and introduce a multi-scale graph learning strategy to encode both fine-grained stroke details and global character structure. Extensive experiments verify the effectiveness of G-HTR, outperforming previous state-of-the-art methods on both our CHTR-110K and the widely-used CASIA-OLHWDB dataset. G-HTR further shows strong real-world results when deployed on robots, confirming its practical value. To support future research, we will release source code and dataset.
Paperid: 2623,   Poster  
Authors: Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valery Weber, Ahmed Nassar, Peter W. J. Staar
Title: MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Abstract: Automatically extracting chemical structures from documents is essential for information retrieval in the chemistry domain and for creating training datasets for machine learning models. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic largescale processing.In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure.To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M-1k, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets will be released publicly.
Paperid: 2624,   Poster  
Authors: Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee
Title: A More Word-like Image Tokenization for MLLMs
Abstract: Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an images into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the wordlike units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses standard LLaVA 1.5 using only about 11% of the original visual tokens, substantially reducing memory and latency and making visual inputs more compatible with LLM-based reasoning.
Paperid: 2625,   Poster  
Authors: Zhiyuan Hua, Cornelia Fermuller, Yiannis Aloimonos
Title: Moving Border Ownership for Event-based Motion Segmentation
Abstract: Event cameras provide accurate information at motion boundaries—exactly where disentangling egomotion, object motion, and border ownership determines segmentation quality. We argue that the missing ingredient in dynamic scene interpretation is moving border ownership: detecting motion boundaries and assigning which side is foreground so occlusions are resolved by design.Traditional geometric motion segmentation pipelines (e.g., flow clustering, simple motion models) remain assumption-heavy and slow, while deep models often fail to generalize across sensors or datasets. We introduce a lightweight, ownership-aware predictor trained solely on synthetic events with perfect supervision for boundaries, ownership, and motion, generated via a Blender pipeline. Its key targets—a signed-distance ownership field and a motion mask—focus learning where events occur and yield stable gradients. The model runs in real time and generalizes without tuning: trained on synthetic events, it achieves zero-shot transfer on EED, EVIMO1, EVIMO2, and EMSMC, delivering state-of-the-art performance. By casting motion segmentation as ownership-aware edge understanding, we combine the robustness of model-based reasoning with the scalability of learning.
Paperid: 2626,   Poster  
Authors: Minh Dinh, SouYoung Jin
Title: Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
Abstract: Largescale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision--language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across Caltech101 and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.
Paperid: 2627,   Poster  
Authors: Zirui Xu, Biao Yang, rongrong Ni, Zhongkai Zhou, Shaobo Shen
Title: W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
Abstract: Pedestrian trajectory prediction is crucial for applications such as autonomous driving and social robots. Recently, language model (LM)–based trajectory prediction has offered both prediction accuracy and interpretability. However, the L2 loss commonly used in trajectory prediction cannot be directly applied to LM optimization, resulting in degraded prediction performance. Moreover, current LMbased trajectory prediction methods lack explicit expressions of social interactions, and their scene descriptions are overly simplistic, making it challenging to impose practical scene constraints. To address these issues, we propose Write-to-Walk (W2W). First, we construct a pedestrian trajectory dataset with explicit interaction semantics and generate parsable prompts based on observed trajectories and interaction cues (companion/following/obstacle), alleviating the lack of interaction semantics in prompts. Afterward, a T5-Small backbone is trained in a two-stage manner: (1) Full-parameter supervised fine-tuning with cross-entropy loss for language learning, endowing W2W with the capability for formatted question answering; (2) Reinforcement Learning (RL) to optimize W2W, where a reward function that integrates ADE error and off-road penalties strengthens scene constraints, producing future trajectories consistent with the scene context and further improving prediction accuracy. Experiments on the benchmarking datasets (ETH/UCY and SDD) demonstrate that W2W outperforms LM-based prediction methods on ADE/FDE metrics and achieves competitive results compared with SOTA trajectory prediction methods. Meanwhile, the interpretability of LMs further enhances W2W’s prospects for deployment in safety-critical domains, such as autonomous driving.
Paperid: 2628,   Poster  
Authors: Miaotian Guo, Shuguang Dou, Yin Li, Aidong Men, Dongsheng Jiang
Title: Agentic Video Summarization via Self-Reflecting Multimodal Understanding
Abstract: The rise of AI agents powered by large language models (LLMs) has transformed intelligent systems by enabling autonomous tool utilizing, reasoning, and action across diverse tasks. Despite this rapid progress, existing video summarization approaches primarily focus on feature extraction or framelevel importance regression but lack the autonomous reasoning, self-correction, and decision-making capabilities that define true agent-based intelligence. To bridge this gap, we propose AgenticVS—the first agentic workflow for video summarization that leverages multimodal large language models (MLLMs) to complete the summarization–verify–reflection loop in a fully autonomous manner. Rather than designing new architectures for feature extraction or regression, we exploit the understanding and reflective reasoning abilities of MLLMs to build an adaptive summarization framework with a self-reflecting workflow. Experiments on SumMe and TVSum demonstrate that our agentic workflow outperforms state-of-the-art methods, enhancing interpretability, adaptability, and paving the way for agent-based multimodal video understanding.
Paperid: 2629,   Poster  
Authors: Zhehao Huang, Baijiong Lin, JINGYUAN ZHANG, Jingying Wang, Yuhang Liu, Ning Lu, Tao Li, Xiaolin Huang
Title: VL-RouterBench: A Benchmark for Vision–Language Model Routing
Abstract: Multimodel routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision–language models (VLMs). We presentVL-RouterBenchto assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample–model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample–model pairs and a total input–output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.
Paperid: 2630,   Poster  
Authors: Shuo Li, Bingchen Miao, Wendong Bu, Juncheng Li, Hanwang Zhang, Fei Wu
Title: DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated promising advancements in augmenting the capabilities of LLMs to comprehend visual input. However, modality misalignment between vision and text remains a key challenge in MLLM, which can be attributed to two aspects: misalignment of modalityspecific representations and depletion of modality-specific details. To address the issue of modality misalignment, we propose DeepAlign, a novel multimodal alignment framework to mitigate modality conflict, which employs representation intervention and structure-induced knowledge distillation to prevent the misalignment and depletion of modality-specific information. Extensive experiments demonstrate that DeepAlign significantly mitigates modality conflicts, leading to substantial performance improvements compared to backbone models across multiple vision-language tasks. It also stimulates some emergent abilities in MLLMs, such as multimodal in-context learning on interleaved text-image sequences.
Paperid: 2631,   Poster  
Authors: Zhiyu Zhou, Feng Hui, Yu Liu
Title: AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
Abstract: 3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGSbased SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization and Gaussian map initialization, enhancing localization accuracy under exposure-varying scenarios. Extensive experiments on public datasets and our self-collected real-world dataset demonstrate that AERGS-SLAM outperforms baselines in both localization performance and photorealistic mapping quality.
Paperid: 2632,   Poster  
Authors: Han Han, Wei Zhai, Tiesong Zhao, Bin Li, Yang Cao, Zheng-Jun Zha
Title: Unsupervised 3d Motion Estimation Using Event Camera
Abstract: Estimating the 3D motion of scene points from 2D observations, typically parameterized by optical flow and motion in depth, is a fundamental problem in computer vision. Existing learningbased methods usually rely on supervised regression from densely labeled data, but their dependence on annotations and limited use of geometric constraints restricts generalization, motivating unsupervised solutions. Unsupervised 3D motion estimation is challenging because motion along the viewing direction is unobservable, and optical flow and motion in depth are geometrically coupled, making their separation ambiguous. Event cameras capture per-pixel brightness changes asynchronously with microsecond latency, providing high temporal resolution and motion continuity. Projecting event streams along different axes reveals spatiotemporal expansion and contraction patterns that encode depth variation and geometric structure, offering rich cues for unsupervised estimation. Leveraging these properties, we propose an unsupervised event-based 3D motion estimation framework that jointly models optical flow and motion in depth. We first derive an analytical relationship to infer initial motion in depth from estimated flow and further refine it using a directional expansion modulation module that captures horizontal and vertical expansion–contraction patterns in event projections. Finally, motion in depth is incorporated into optical flow warping under a contrast maximization objective. Experiments on the CarlaEvent3D dataset show that our method achieves competitive accuracy and strong generalization, advancing unsupervised 3D motion estimation in the event domain.
Paperid: 2633,   Poster  
Authors: Huaning Li, Ziming Wang, Runhao Jiang, Rui Yan, Huajin Tang
Title: Spike-driven Discrete Aggregation for Event-based Object Detection
Abstract: With their high dynamic range and temporal resolution, event cameras are wellsuited for object detection, especially under motion blur and extreme illumination.Recent state-of-the-art works for event-based object detection primarily focus on the high-level design of backbones. However, developing effective event representations is equally crucial, as it bridges asynchronous event streams with the dense tensors required by detection networks. Most existing aggregation strategies for event representation continuously accumulate all events within sampled intervals without selective filtering, inevitably introducing uninformative events that degrade detection accuracy.To address this limitation, we introduce a novel perspective, termed Discrete Aggregation, which adaptively and discretely selects informative events for differentiable aggregation. We realize this through the Spiking Discrete Aggregation (SDA) module, which is inspired by the threshold-based spike firing mechanism in Spiking Neural Networks (SNNs) and implemented using gated recurrent spiking neurons.Additionally, we introduce the Multi-Timescale Fusion (MTF) method which leverages coarse-grained temporal features from continuous event streams to further enhance the representation capability of SDA. Experimental results on neuromorphic datasets demonstrate that our method achieves state-of-the-art performance among all fully spiking architectures while using fewer parameters, reaching 43.4% mAP_50:95 on Gen1 (+ 4.5% over prior art). Moreover, our method exhibits superior robustness under noisy conditions and shows strong compatibility with non-spiking models.
Paperid: 2634,   Poster  
Authors: Tianrui Yu, Xiubo Liang, Hongzhi Wang
Title: RetFormer: Multimodal Retrieval for Enhancing Image Recognition
Abstract: The expansion of Transformers and the collection of highquality multimodal datasets have propelled deep neural networks to achieve unprecedented performance in vision and language tasks. However, applying these advances is non-trivial in real-world applications. The extensive number of parameters complicates model updates, and real-world data often features a long-tailed distribution along with noisy labels. To address the above issues, we propose to explore the internal structure of the neural network for learning with sample relationships, rather than just increasing the number of model parameters. Specifically, we introduce RetFormer, a model enhanced with a multimodal knowledge base for storing world knowledge, and a retrieval cross-fusion module designed to establish robust multimodal sample relationships by leveraging content from the knowledge base. RetFormer establishes a robust relationship between image and text modalities by integrating information from external knowledge bases into the model's decision-making process, thus overcoming the limitations of traditional approaches on model size and datasets. Our experiments demonstrate the benefits of integrating large-scale image-text datasets into vision tasks and exemplify the importance of modeling the relationship between image and text modalities. We have evaluated our approach on the task of long-tailed recognition and learning with noisy labels and have shown that it achieves state-of-the-art accuracies.
Paperid: 2635,   Poster  
Authors: Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, Paul Debevec, Ning Yu
Title: Vista4D: Video Reshooting with 4D Point Clouds
Abstract: We presentVista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method resynthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. Results are best viewed as videos in the Supplement.
Paperid: 2636,   Poster  
Authors: Haoru Tan, WU Sitong, Yanfeng Chen, Shizhen Zhao, Yangtian Sun, Tianjia Liu, Chirui Chang, Shaofeng Zhang, Xingwu Sun, Xiuzhe Wu, Ruobing Xie, Xiaojuan Qi
Title: Dynamic Important Example Mining for Reinforcement Finetuning
Abstract: Reinforcement finetuning (RFT) is increasingly used to strengthen the reasoning abilities of large models, yet its effectiveness is bounded by how training data are selected and used. Most data-centric RFT methods rely on static or heuristic sample selection, implicitly assuming a sample’s value is fixed over training. This overlooks the non-stationary dynamics of policy learning and can lead to suboptimal updates.We proposeDynamic Important Example Mining (DIEM), a principled and fully automated framework that makes data utilization adaptive throughout RFT. DIEM integrates two components into each optimization step: (i) a gradient-alignment importance estimator that efficiently approximates each sample’s marginal contribution to policy improvement; and (ii) a constrained batch reweighting scheme that maximizes aggregate utility while preserving the update’s gradient magnitude to stabilize optimization. This converts data selection from a one-time preprocessing heuristic into an intrinsic part of the learning algorithm, yielding a self-organizing, curriculum-like training trajectory driven by model dynamics rather than external scores.Across several multimodal reasoning benchmarks, DIEM consistently outperforms strong static and dynamic baselines, providing a significant performance uplift to the base RFT algorithm of approximately1%to6%, while introducing only a minimal1.2%training overhead.
Paperid: 2637,   Poster  
Authors: Sara Aghajanzadeh, Xiaoyang Bai, Zhongmin Zhu, David Forsyth, Viktor Gruev
Title: Global Underwater Geolocation from Time-Lapse Polarization Imagery
Abstract: It is extremely hard for an underwater agent to know where it is. Satellite signals disappear within centimeters of the surface; acoustic baselines require heavy infrastructure to instrument small regions. The polarization of the sky, visible underwater, reveals the elevation of the sun. The pattern of elevation over the day reveals location to an agent with a clock. However, recovering elevation from polarization images is very difficult. SOTA geolocalization methods can localize well for locations where they have seen data, but accuracy collapses when the data comes from a new location. Our physicsguided synthesis pipeline expands a huge library of polarization images from a small set of sites to 2.8~million solar-elevation–matched training sequences spanning latitudes, seasons, and water types. A compact two-stage transformer reconstructs the solar-elevation curve and predicts geolocation. Under leave-one-site-out tests, the site averaged median geodesic error is ~500km—about an eightfold improvement over previous deep-learning baselines; with limited target-site data, the median error contracts to single-digit kilometers.
Paperid: 2638,   Poster  
Authors: Keyang Ye, Hongzhi Wu, Kun Zhou
Title: Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
Abstract: Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent GaussianEnhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques across a wide range of scenes.
Paperid: 2639,   Poster  
Authors: Zihan Zheng, Zhaoyang Jia, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Zhenghao Chen, Houqiang Li, Yan Lu
Title: Generative Video Compression with One-Dimensional Latent Representation
Abstract: Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ highcapacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4% under LPIPS and 68.8% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.
Paperid: 2640,   Poster  
Authors: Jaejin Lee, Minjae JEONG, Joonhyuk Park, Yechan Hwang, Seunghun Baek, Won Hwa Kim
Title: Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Abstract: Embedding invisible and temporally consistent watermarks into dynamic 4D Gaussian Splatting (4DGS) models poses unique challenges due to continuous spatiotemporal deformation of Gaussians and diverse motion dynamics.Existing 3DGS watermarking methods, which directly fine-tune parameters within Gaussian splats, fail to preserve geometric fidelity and temporal coherence when extended to dynamic 4D settings. In this regime, we propose Mark4D, a temporally consistent watermarking technique that achieves robust, imperceptible, and motion-aware watermark embedding for 4DGS. Mark4D comprises 1) a decoder for watermark recovery in the latent video–text space for robustness against pixel-level distortions, 2) trajectory-aligned offsets that embed watermark signals along Gaussian motion paths to preserve geometry, and 3) a motion-adaptive loss weighting strategy that balances supervision across frames with varying motion intensities. Extensive experiments on synthetic and real-world dynamic scene datasets demonstrate that Mark4D achieves superior bit accuracy, visual fidelity, and robustness under diverse distortions, establishing a foundation for secure and reliable protection of dynamic 4D scene assets. The implementation of Mark4D will be released upon publication.
Paperid: 2641,   Poster  
Authors: Tianyi Lyu, Mingye Ju, Kai-Kuang Ma
Title: Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus
Abstract: Current dehazing methods face two intertwined challenges: (1): the misidentification of hazerelated features due to domain-specific interference in both single-domain and empirically integrated multi-domain approaches, and (2): severe chromatic distortion caused by haze-induced inherent color entanglement. To overcome these limitations, we propose a unified framework centered on a Cross-domain Invariant Manifold (CIM), which constructs a consistent latent representation space by aligning multi-domain features through shared scattering semantics. The manifold is optimized via consensus density-driven contrastive learning, effectively enhancing cross-domain consistency while eliminating domain-specific biases. Building upon this structured foundation, we further introduce a disentanglement-wise architecture, i.e.the Physics-Guided HSV Decomposition Network, that explicitly separates entangled color components to ensure robust color fidelity. Comprehensive experiments demonstrate that our CIM-D framework achieves state-of-the-art performance, effectively eliminating haze-induced color shifts and restoring natural scene appearance. The code will be made publicly available.
Paperid: 2642,   Poster  
Authors: Mia Polansky, George Kopanas, Stephan J. Garbin, Todd Zickler, Dor Verbin
Title: Eulerian Gaussian Splatting using Hashed Probability Pyramids
Abstract: We introduce a probabilistic splatbased radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF360 while preserving 3DGS-level rendering speed.
Paperid: 2643,   Poster  
Authors: Ji Shi, Xianghua Ying, Bowei Xing, Ruohao Guo, Wenzhen Yue
Title: RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
Abstract: 3D Gaussian Splatting (3DGS) enables realtime novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we presentRT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing.
Paperid: 2644,   Poster  
Authors: Junwei Xu, Mengzu Liu, Zhenyu Wang, Fangfang Wu, Sijia Wu, Tao Huang, Weisheng Dong
Title: IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
Abstract: Single Image SuperResolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input, which becomes increasingly challenging under real-world computational constraints. However, most efficient SISR methods adopt lightweight, spatially uniform strategies that allocate equal computation and focus across all regions—ignoring the uneven distribution of visual complexity.From an information theory perspective, textures and edges inherently carry more critical information, resulting in reconstruction errors that are disproportionately concentrated in these regions. This motivates allocating more resources and attention to these informative areas.In this paper, we propose IAFMNet, an Information-Aware Feature Modulation network for efficient SR. At its core lies the Information Density Map (IDM), which is estimated in an unsupervised manner by minimizing the information entropy of features, thereby highlighting informative regions with high theoretical encoding costs. Guided by the IDM, IAFMNet adopts a synergistic dual-branch design: (1) a sparse convolution branch that dynamically allocates computation to informative areas while bypassing low-information regions, and (2) an implicit modulation branch that adaptively emphasizes complex regions via information-aware affine transformations.Extensive experiments demonstrate that IAFMNet effectively identifies informative regions and achieves superior visual fidelity with reduced computational overhead.
Paperid: 2645,   Poster  
Authors: Jiandong Jin, Chenglong Li, Hao Feng, Andong Lu, Lili Huang, Jin Tang
Title: Progressive Multi-cue Alignment for Unaligned RGBT Tracking
Abstract: Unaligned RGBT tracking aims to achieve robust target localization across spatially misaligned RGB and thermal infrared (TIR) videos, a crucial challenge for applying RGBT tracking in realworld scenarios.Existing methods often calculate all cross-modal alignment parameters (i.e., spatial shift and scale change) simultaneously, but suffer from two major limitations. 1) They are difficult to adapt to different degrees of unaligned difficulty during tracking. 2) They usually require complex models to handle challenging scenarios, resulting in a large computational burden.To overcome these limitations, we propose a novel Progressive Multi-cue Alignment framework called PMATrack, which disentangles the calculation of cross-modal alignment parameters in a progressive manner and dynamically selects appropriate cues to handle different challenges, thereby enabling robust and efficient unaligned RGBT tracking. In particular, PMATrack divides the cross-modal alignment parameter estimation into three stages to progressively perform center offset computation, scale transformation estimation, and global refinement. At each stage, we design a difficulty-aware router to adaptively select the appropriate alignment expert based on the cross-modal alignment complexity, thereby reducing computational redundancy.In addition, we build a high-quality video benchmark called MUART244 to facilitate the comprehensive evaluation of different unaligned RGBT tracking algorithms. Extensive experiments on our MUART244 and public LasHeR-Unaligned datasets demonstrate the outstanding performance of PMATrack against existing state-of-the-art methods.
Paperid: 2646,   Poster  
Authors: Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, Yong Jae Lee
Title: Learning to Select Visual Tools from Experience
Abstract: We introduce VisualToolAgent (VisTA), a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and compose tools from a diverse library based on empirical performance. Existing methods for toolaugmented visual reasoning either rely on training-free prompting or large-scale supervised fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, guided solely by task outcomes. Leveraging reinforcement learning with verifiable rewards (RLVR), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, MathVerse, and BlindTest benchmarks demonstrate that VisTA achieves significant performance gains over training-free and fine-tuning baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Paperid: 2647,   Poster  
Authors: Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Daniel LK Yamins
Title: Perceptual 3D Simulation With Physical World Modeling
Abstract: Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3D transform signal for conditioning the world model at inference time. The persistent scene memory integrates predictions over time, enabling online updates and consistency under uncertainty. By combining learned inference with explicit geometric structure, P3Sim balances datadriven flexibility with built-in inductive bias. This design yields a flexible perceptual simulator that generalizes across diverse 3D transformation tasks, such as novel view synthesis, object manipulation, and dynamic scene prediction, advancing toward general purpose 3D scene understanding and transformation.
Paperid: 2648,   Poster  
Authors: Ti Wang, Xiaohang Yu, Mackenzie Mathis
Title: FMPose: 3D Pose Estimation via Flow Matching
Abstract: Monocular 3D pose estimation is fundamentally illposed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses.In particular, diffusion-based models have demonstrated strong performance, but their iterative denoising process typically requires many time steps for each prediction, making inference computationally expensive.In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned on 2D inputs. While the ODE trajectories are deterministic, FMPose naturally generates diverse pose hypotheses by sampling different noise seeds.To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both human and animal 3D pose domains.
Paperid: 2649,   Poster  
Authors: Shaojin Wu, Mengqi Huang, Yufeng Cheng, wenxu wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian HE
Title: Unified Customized Generation by Disentangled Reward Modeling
Abstract: Existing literature typically treats various customized generation tasks (e.g., subjectcustomized generation, style-customized generation) as distinct and disjoint problems, with each task focusing solely on customizing a specific aspect of the reference image. However, we argue that the objectives of these different customization tasks are inherently complementary and can be mutually enhanced within a unified framework, as they fundamentally involve the disentanglement of multiple feature aspects from the reference image. To this end, we introduceUSO, aUnifiedSimultaneousOptimization framework to simultaneously unify different customized tasks (i.e., subject and style). Specifically, USO introduces a cyclical data-model framework that connects these two tasks by a subject-for-style data curation pipeline and a style-for-subject model training pipeline. The subject-for-style data curation pipeline leverages a state-of-the-art subject-customized model to generate high-quality triplet data comprising content images, style images, and their corresponding stylized content images. Building on this foundation, the style-for-subject model training pipeline introduces an auxiliary style reward to simultaneously align style and content features, thereby reinforcing the model’s ability to extract the desired style or content features from the reference image. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models, excelling in both subject consistency and style similarity.
Paperid: 2650,   Poster  
Authors: Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu
Title: AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Abstract: Visionand-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods.
Paperid: 2651,   Poster  
Authors: Mingwen Shao, Lingzhuang Meng, Xiang Lv, Mengyao Wu, Xinyuan Chen, Qiao Zhang, Chang Liu, Yuanjian Qiao, Chao Dong
Title: UniDef: Universal Defense Against Unauthorized Image Manipulation
Abstract: Image protection against unauthorized diffusionbased editing has achieved encouraging progress. However, existing methods face two critical limitations: (1) They only disturb the denoising direction at local step, resulting in generated images still retaining original or edited semantics. (2) Their optimization rely heavily on model-specific gradient, limiting transferable protection across different models and tasks. To address these challenges, we propose a Universal Defense (UniDef) framework for protection against unauthorized image manipulation. Specifically, we first discover that different variants of diffusion models tend to pursue a consistent distribution objective during complete denoising process. Based on this discovery, we design Consistent Distribution Deviation strategy to perturb the diffusion direction at the global denoising, thereby disrupting the overall image semantics. Furthermore, to mitigate model dependency, we devise a Finite Difference-based Jacobian Estimation module to approximate the global gradient in a model-agnostic manner, thus ensuring more transferable protection. Benefiting from the above designs, our method yields generated images no longer preserve the image semantic while possessing excellent generalization. Extensive experiments demonstrate that our UniDef not only outperforms existing methods, but also exhibits universal protection across diverse models and tasks.
Paperid: 2652,   Poster  
Authors: Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark A. Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo
Title: PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcomebased signal. Recent work shows that providing a fine-grained, model-intrinsic signal—rewarding the confidence growth in the ground-truth answer—effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
Paperid: 2653,   Poster  
Authors: Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Max Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi
Title: DuetGen: Towards General Purpose Interleaved Multimodal Generation
Abstract: Unified multimodal generation aims to jointly model imageto-text and text-to-image tasks within a single architecture. However, current approaches struggle to produce coherent, interleaved sequences of text and images. This limitation hinders applications that rely on tightly integrated multimodal outputs—such as step-by-step instructional guides, visual planning tools, and interactive content editing—where textual explanations and visual elements must be generated in a coordinated manner. We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. In terms of data, we construct a large-scale high-quality instruction-tuning corpus combining curated web content, rewritten multimodal conversations, and diverse synthetic examples covering everyday scenarios. Architecturally, DuetGen builds upon a pretrained MLLM and diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining while remaining scalable. A two-stage decoupled training strategy first instruct-tunes the MLLM and then aligns it with the DiT using large-scale curated interleaved image–text sequences. Experiments on public and newly constructed benchmarks show that DuetGen substantially outperforms prior open-source systems across text quality, image fidelity, and image–context alignment, achieving substantial gains on text-to-image and image-editing benchmarks. Code and data will be released.
Paperid: 2654,   Poster  
Authors: Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo
Title: M${^2}$SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Abstract: Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application.This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the largescale language component in MLLMs with a compact 125M-parameter model.Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.
Paperid: 2655,   Poster  
Authors: Qianwei Tang, Baile Xu, Jian Zhao, Furao Shen
Title: Coordinate Denoising for Non‑Equilibrium Molecular Representation Learning
Abstract: Threedimensional molecular representation learning has shown great promise in modeling chemical structures and their properties. However, most existing approaches implicitly assume molecules are at or near equilibrium states. This assumption breaks down for non-equilibrium structures—ubiquitous in molecular dynamics (MD) trajectories—where standard coordinate denoising techniques fail because the direct equivalence between denoising scores and atomic forces no longer holds. To bridge this gap, we propose Node Denoising on non-Equilibrium Molecules (NDeM), a novel auxiliary task grounded in a second-order finite difference approximation of the potential energy surface. By explicitly accounting for the non-zero inherent forces in non-equilibrium states, NDeM provides a theoretically sound denoising objective applicable to arbitrary molecular conformations. Crucially, our method is designed as a lightweight, architecture-agnostic plugin that requires no pre-training and can be seamlessly integrated into various supervised learning pipelines. Extensive experiments across diverse benchmarks, including MD17, QM9, and the large-scale OC20 dataset, demonstrate that NDeM consistently improves baseline models, yielding highly competitive performance and validating its robustness across both equilibrium and non-equilibrium regimes.
Paperid: 2656,   Poster  
Authors: Moru Liu, Hao Dong, Olga Fink, Mario Trapp
Title: Adaptive Confidence Regularization for Multimodal Failure Detection
Abstract: The deployment of multimodal models in highstakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be publicly released.
Paperid: 2657,   Poster  
Authors: Yi YANG, Zheng Wang, Xing Xu, Jingkuan Song, Heng Tao Shen
Title: Gravitation-Driven Semantic Alignment for Text Video Retrieval
Abstract: The inherent semantic ambiguity of “manyto-many”, where one video matches multiple texts and vice versa, aggravates the difficulty in text-video retrieval. The dominant deterministic embeddings only struggle to capture the mean semantics, while existing probabilistic methods fail to distinguish hard negatives for their imposing rigid uncertainty priors or ignoring the interaction between similarity and uncertainty. To this end, we propose a novel physics-inspired framework (GraviAlign) that decomposes the alignment of cross-modal semantic distributions into two orthogonal factors inspired by the Gravitational Force: (1) Semantic Attraction measuring gravitational alignment between distribution centers via uncertainty-derived “semantic mass” and “semantic distance”; (2) Geometric Overlap quantifying distribution intersection. Each factor has independent veto power to reject those matches with misalignment or poor overlap. Additionally, GraviAlign offers an efficient (O(D)), theoretically grounded alternative to intractable joint integrals. Extensive experiments on DiDeMo, MSR-VTT, and ActivityNet demonstrate our effectiveness and superiority, and solid ablation studies confirm the indispensability of two novel components.
Paperid: 2658,   Poster  
Authors: Aoyu Liu, Zhen Liu, Ziyi Wang, Dian Chen, Bing Zeng, Shuaicheng Liu
Title: ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction
Abstract: Singleimage HDR reconstruction aims to recover high dynamic range radiance from a single low dynamic range (LDR) input, but remains highly ill-posed due to detail saturation in over-exposed regions and noise amplification in under-exposed areas. While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. Specifically, a soft exposure mask is first constructed to separate the LDR image into over-, under-, and well-exposed regions. Based on this partition, region-conditioned consistency trajectories are designed to hallucinate saturated details, suppress noise in dark regions, and preserve reliable structures within a single, distillation-free inference step. To further enhance perceptual quality, we introduce an Exposure-guided Luminance-Chromaticity Loss in the CIE Lab space, which assigns exposure-aware weights to luminance and chromaticity components, effectively mitigating brightness bias and color drift. Extensive experiments on the HDR-REAL, HDR-EYE, and AIM2025 benchmarks demonstrate that ExpoCM achieves state-of-the-art fidelity and perceptual accuracy, while enabling over 400× and 20× faster inference compared to DDPM (1000 steps) and DDIM (50 steps), respectively. Code will be released to facilitate future research.
Paperid: 2659,   Poster  
Authors: Bing Cai, Xiaoli Wang, Gui-Fu Lu, Zechao Li
Title: Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
Abstract: Multiview contrastive clustering has emerged as a powerful paradigm for learning comprehensive representations from heterogeneous data sources. However, prevailing approaches typically overlook the intrinsic geometric and clustering structures, rendering them structure-agnostic. In this paper, we propose a novel framework that performs Multi-Hierarchical Contrastive Spectral Fusion (MCSF) to address these limitations. MCSF integrates deep spectral embedding into the encoder to preserve local manifold structure, guiding the learned representations to be clustering-friendly. To enhance cross-view consistency, MCSF introduces a multi-hierarchical contrastive loss jointly optimizing (1) view-specific structure preservation, (2) view-consensus alignment, and (3) consensus structure refinement. This mechanism enables the construction of an accurate and semantically consistent consensus representation, effectively fusing multi-view information and uncovering authentic cluster structures. Extensive experiments on benchmarks validate the effectiveness of multi-hierarchical contrastive spectral fusion in clustering accuracy and representation quality.
Paperid: 2660,   Poster  
Authors: Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Title: M4V: Multimodal Mamba for Efficient Text-to-Video Generation
Abstract: Textto-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multimodal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a multimodal Mamba framework for efficient text-to-video generation. Specifically, a MultiModal diffusion Mamba (MM-DiM) block is designed within the framework to enable seamless integration of multimodal information and spatiotemporal modeling. In detail, we introduce a novel multimodal token re-composition design, which employs a bidirectional scheme for multimodal information integration through simple token arrangement, along with visual registers to enhance spatial–temporal consistency. As a result, the MM-DiM blocks in M4V reduce FLOPs by 45% compared with the attention-based alternative when generating videos at 768×1280 resolution. Additionally, several training strategies are explored in this work to provide a better understanding of training text-to-video models using only publicly available datasets. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code will be made publicly available.
Paperid: 2661,   Poster  
Authors: Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo
Title: Enhancing Vision Language Models for 4D Perception
Abstract: Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about 3D motion, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection on 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these, we present a QA generation pipeline that focuses on motionrelated scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate large-scale 400K training samples and a 2.2K-sample benchmark. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.
Paperid: 2662,   Poster  
Authors: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Shangzhan Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning
Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise for GUI automation, enabling agents to learn from binary task completion signals. However, when task difficulty exceeds model capacity, onpolicy exploration fails to discover correct actions, creating zero-advantage traps that eliminate learning signals. While incorporating off-policy expert demonstrations seems intuitive, it causes persistent high-entropy states due to distribution mismatch, disrupting effective learning. We propose GUI-SAGE, a self-explanation framework that generates in-distribution reasoning trajectories for GUI automation. By conditioning on ground-truth actions, our method produces in-distribution guidance that avoids the confusion caused by out-of-distribution expert demonstrations. We further introduce Entropy-Modulated Credit Assignment, which recalibrates learning weights by jointly considering prediction confidence and reward signals, enabling amplified updates for confident correct actions and attenuated updates for uncertain explorations. Extensive experiments on AndroidControl and GUI-Odyssey demonstrate that GUI-SAGE-3B achieves competitive performance with 81.1% success rate, substantially outperforming existing methods. Our analysis validates that self-explanations maintain stable learning dynamics while expert demonstrations cause entropy collapse, and that entropy modulation provides the largest improvements on in-distribution samples.
Paperid: 2663,   Poster  
Authors: Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu, Yun LI, Xiaogang Xu, Harry Yang
Title: Learning Latent Proxies for Controllable Single-Image Relighting
Abstract: Singleimage relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic- or G-buffer–based pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that full intrinsic decomposition is unnecessary for accurate relighting. Instead, sparse but physically meaningful cues—indicating where illumination should change and how materials should respond—are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates minimal physical priors at two levels: a few-shot latent proxy encoder that extracts compact material–geometry cues from limited PBR supervision, and a lighting-aware mask that identifies illumination-sensitive regions and steers the denoiser toward shading-relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that aligns predicted cues with perceptually preferred relighting behavior. We further present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera–light metadata, enabling physically consistent and controllable training. Across object- and scene-level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion- and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.
Paperid: 2664,   Poster  
Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski
Title: FRM: Linear-Time 3D Reconstruction via Test-Time Training
Abstract: Feedforward transformer models such as VGGT and \pi^3 are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be queried at real-time rates to produce colored point maps from novel views.
Paperid: 2665,   Poster  
Authors: Junming Zhang, Shuyu Yin, Peilin Liu, Rendong Ying, Fei Wen
Title: Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
Abstract: Testtime adaptation (TTA) aims to enhance the cross-domain performance of pre-trained models by adapting to unlabeled test data.While most existing TTA methods rely on backpropagation (BP) for finetuning, BP-free methods such as zeroth-order (ZO) methods are more desired in practical on-device scenarios. ZO methods rely only on forward computation, which can largely reduce the complexity and memory overhead of on-device deployment.However, ZO methods suffer from much higher variance compared with first-order methods in estimating the gradient.To address this, we propose an improved ZO method to substantially boost the performance of ZO optimization based TTA.First, we provide an observation to reveal the persistent low-rank Hessian structure of the loss during the adaptation process. Based on this insight, we then propose a loss-landscape curvature-aware zeroth-order (CAZO) method, which leverages a sliding-average estimation of the diagonal Hessian to construct a covariance matrix for anisotropic‌ perturbation sampling. CAZO operates by freezing pretrained weights and optimizing minimal adapter parameters via forward-only passes based gradient estimation, which can substantially reduce the memory overhead compared to BP-based methods. Extensive experiments demonstrate that CAZO significantly outperforms existing TTA methods, achieving state-of-the-art performance while maintaining an excellent balance between accuracy and memory efficiency. Code is provided in supplemental material.
Paperid: 2666,   Poster  
Authors: Yangshi Ge, Zheng Liu, Feng Lu
Title: Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
Abstract: Deep learningbased gaze estimation methods tend to suffer from substantial performance drop in real-world scenarios with varying users and environments. To tackle this issue, most recent approaches employ Unsupervised Domain Adaptation (UDA) to bridge the gap between source and target domains. However, this paradigm is misaligned with real-world scenarios, where the system typically needs to adapt to only a single new user. Therefore, this paper advocates a more practical paradigm: Unsupervised Personal Adaptation (UPA), which calibrates a pre-trained model using a few unlabeled images from a single new user. Conventional UDA methods do not guarantee improvements for every user and often yield lower average performance in this setting. To address this problem, we propose Render-to-Adapt (R2A), a self-supervised framework specifically designed for the UPA task. Given a pretrained gaze model, R2A utilizes a gaze-conditioned renderer to synthesize new images based on the model's gaze predictions, and enforces eye-region consistency as a label-free signal to enhance personalized gaze estimation. We evaluate R2A on a re-designed cross-dataset personal adaptation benchmark. Experimental results show that R2A consistently improves performance across all individuals and significantly outperforms existing SOTA methods.
Paperid: 2667,   Poster  
Authors: Liyi Chen, Pengfei Wang, Guowen Zhang, Zhiyuan Ma, Lei Zhang
Title: Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
Abstract: Most instructiondriven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.
Paperid: 2668,   Poster  
Authors: XIAOHUI HAO, Yanglin Pu, Yongjun Wang, Rui She
Title: Distribution-Aligned Multimodal Fusion for Robust Object Detection
Abstract: Crossdegradation generalization remains a critical challenge for RGB-infrared multimodal object detection, especially when training data covers limited degradation types. This paper presents a distribution alignment framework with a key insight: aligning fused features to the pretrained distribution where the frozen detector performs optimally, rather than adapting to training-specific degradations. By freezing the pretrained detector and training only a lightweight fusion module (15% of total parameters), our approach leverages complementary infrared information to reduce distribution shift while maintaining computational efficiency. The method achieves state-of-the-art results on three benchmarks (LLVIP, FLIR, DroneVehicle) with 4× faster training. Critically, we demonstrate that aligning to the pretrained distribution substantially outperforms aligning to training degradations when generalizing to unseen scenarios.
Paperid: 2669,   Poster  
Authors: TIANQI ZHAO, Di Wu, Liangrui Peng, Yifan Huang, Kemeng Zhao, Shuo Li, Zhiyu Li, Yizhu Wang, Borui Jiang, Yuyang Li
Title: DREAM: Document Recognition with Explicit Adaptive Memory
Abstract: Large multimodal models (LMMs) have shown promising performance for various document recognition tasks. However, LLMs adopt implicit modeling, and the parameters lack interpretability. Inspired by recent advances in human memory and learning research, We propose an explicit multiscale prototype memory that augments document recognition models, explicitly modeling recurrent layout and stylistic patterns across different spatial resolutions. A Memory Retrieval Mechanism enables local regions to sparsely attend to a few prototypes (e.g., image borders, tilted text); the retrieved compositional factors are concatenated with visual features and passed to the decoder, providing explicit regionwise structural context. Prototype memory consolidation updates and stabilizes prototypes via attention-weighted exponential moving average (EMA) strategy, while sparsity and anti-collapse regularization promote selective activation and disentanglement. We further adopt hierarchical memory and a scale-adaptive attention module for multi-resolution encoding, trained with a multi-task, entropy-regularized objective. We validate on two tasks including document recognition on the Fox and the self-built DreamDoc dataset, and handwriting recognition on the SCUT-HCCDoc and SCUT-EPT Chinese handwriting datasets. Experimental results show that the proposed method is effective.
Paperid: 2670,   Poster  
Authors: Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing LYU
Title: DRM: Diffusion-based Reward Model With Step-wise Guidance
Abstract: Current mainstream methods of aligning diffusion models with human preferences typically employ VLMbased reward models.However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities—such as aesthetics, composition, and visual harmony.In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes.Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone.A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways.First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment.Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes.Extensive experiments confirm that our approach significantly enhances the final quality of generated images.
Paperid: 2671,   Poster  
Authors: Dong Wang, XIANGYU HE, Xinqi Lyu, Bin Xiao
Title: Breaking Multimodal LLM Safety via Video-Driven Prompting
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual reasoning tasks, such as image and video understanding. Recent studies have introduced several effective imagebased jailbreak methods. However, these approaches are often mitigated by pre-defined system prompts and overlook vulnerabilities within the video encoder. In this work, we show that video-based attacks are significantly more effective than image-based ones. Specifically, we find that simply repeating a harmful image across multiple frames to construct a video can bypass the safety mechanisms of MLLMs. Our analysis reveals that unsafe videos are embedded more similarly to safe videos in the model’s representation space than individual harmful images, making them harder to detect. Moreover, videos composed of identical frames are processed more like static images and are more likely to trigger safety defenses compared to videos with diverse frames. Motivated by these findings, we propose an algorithm that injects harmful content into typographic videos by interleaving it with diverse, safety-proximal frames, thereby evading MLLM safety alignment. Extensive experiments demonstrate that our approach achieves state-of-the-art jailbreaking performance on several widely-used MLLMs (e.g., VideoLLaMA-2, Qwen2.5-VL, GPT-4.1, and Gemini-2.5) under 16 different safety policies.
Paperid: 2672,   Poster  
Authors: Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu “Max” Jiang
Title: Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Abstract: Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while highfidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for system validation and training purposes. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite that we refer to as the AV log, which includes multi-view camera images and LiDAR point clouds. A core challenge that arises is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform a comprehensive set of quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.
Paperid: 2673,   Poster  
Authors: Yisheng He
Title: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
Abstract: We introduce a feedforward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Code will be made publicly available.
Paperid: 2674,   Poster  
Authors: Jiwon Kim, SeonHwa Kim, Soobin Park, Eunju Cha, Kyong Hwan Jin
Title: Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
Abstract: Recent T2I diffusion models have evolved to control multiple conditions, including structure, appearance, and text prompt. Despite this progress, trainingbased methods demand heavy computation, whereas training-free methods often 're-imagine' the subject to satisfy given structure, thereby compromising identity preservation and attenuating fine textures.We propose SeAl (Semantic Alignment for Pose-Invariant Identity Preserving Diffusion), a novel training-free framework that addresses the 're-imagining' problem from the perspective of 'infusion'. SeAl integrates structure, appearance, and text prompt with three modules: AnchorAlign pre-aligns spatial discrepancies, Reference-guided Appearance Infusion injects identity via semantic matching, and Delta-Bridge leverages the guidance delta to mediate text–appearance conflicts. We demonstrate that our method faithfully reflects all three control factors and dramatically reduces the identity leakage endemic to prior methods. Notably, SeAl excels on challenging datasets where identity preservation typically fails (e.g., distinctive animal features or complex human attire), establishing a novel paradigm for training-free identity preservation in diffusion models.
Paperid: 2675,   Poster  
Authors: Zicheng Zhang, Hongyi Jing, Rui Lv, Shuo Fang, Shiai Zhu, Junying Wang, Chunyi Li, Xiaohong Liu, Chenguang Ma, Guangtao Zhai
Title: Exposing and Evaluating Hallucinations for GUI Grounding
Abstract: Existing GUI benchmarks primarily focus on evaluating models’ comprehensive capabilities but largely overlook hallucination phenomena in grounding tasks, which are crucial to the reliability of GUI understanding. In this work, we expose two major types of hallucinations in GUI grounding: 1) Confusion Hallucination, where distractor elements are mistakenly selected, and 2) Fabricated Hallucination, where nonexistent elements are hallucinated with plausible coordinates. To systematically investigate their origins, we introduce GUIHalluBench, a benchmark comprising two complementary subsets: a parsing subset for assessing structural representation of GUI elements and a hallucination subset for measuring grounding robustness under challenging conditions. This design allows us to associate hallucination patterns with deficiencies in prerequisite abilities: parsing errors are closely tied to both fabricated and confusion hallucinations. Experiments on state-of-the-art models confirm these connections, offering new insights into the root causes of hallucinations and guiding the development of more reliable GUI understanding tools.
Paperid: 2676,   Poster  
Authors: Sen Liang, Fengbin Guan, Youliang Zhang, Xin Li, Zhibo Chen
Title: CoT-Edit: Let CoT Guide Instruction Video Editing
Abstract: Textdriven instruction-based video editing in complex scenes remains challenging: purely textual prompts often fail to capture precise spatial relationships and physical constraints, resulting in target ambiguity and physically implausible outcomes. To address this, we propose a plan--guide--edit framework that explicitly bridges semantic intent and spatial execution. In our framework, a Chain-of-Thought (CoT)-enhanced multimodal large language model (MLLM) serves as a planner, performing structured reasoning over the video and instructions to derive a precise sequence of bounding boxes and attribute-enriched editing directives. These spatial priors then guide a box-conditioned mask generator, transforming ambiguous global retrieval into localized, context-aware refinement and producing masks that more accurately capture object scale, contact relationships, and placement. Building on these spatial and semantic signals, a diffusion-based editor integrates the masks, enriched instructions, and frame features to render high-fidelity edits that remain temporally coherent and spatially well aligned. Trained first in a modular manner and then jointly, our framework achieves superior performance with reduced data requirements, delivering precise localization in scenes with multiple similar objects and physically consistent object additions, and extensive experiments demonstrate state-of-the-art performance over multiple strong baseline methods.
Paperid: 2677,   Poster  
Authors: Johannes Schusterbauer, Jannik Wiese, Nick Stracke, Timy Phan, Björn Ommer
Title: Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
Abstract: Accurate weather forecasts are essential across various domains and are safetycritical in extreme weather conditions.Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting.In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation.However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process.In this work, we introduceFREUD, aFrame-wiseEncoder andUnitedDecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty through ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by scaling model size. With FREUD and the latent rectified flow model, we aim to push the boundaries of data-driven weather nowcasting.
Paperid: 2678,   Poster  
Authors: Zhiyang Lu, Ming Cheng
Title: Text-guided Feature Disentanglement for Cross-modal Gait Recognition
Abstract: Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in longrange, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the top\text-k matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.
Paperid: 2679,   Poster  
Authors: Hai Duong, Son Vu, Thanh Le, ThanhVu Nguyen
Title: Verifying Neural Network Robustness with Dual Perturbations
Abstract: Safetycritical deep learning systems must be robust against real-world corruptions combining spatially correlated distortions and independent noise.Current deep neural network verification methods handle these perturbations separately, either checking independent pixel-wise perturbations or restricted convolutional transformations using predefined patterns.This gap prevents assessing robustness under realistic conditions where both perturbation types occur simultaneously.To address these limitations, we propose VeriDou, a framework that introduces:(i) universal convolutional perturbations that enable verification across continuous spatial distortion spaces, and(ii) dual perturbations that capture both convolutional distortions and independent pixel-level variations.Our evaluation on a set of diverse benchmarks with 14340 instances shows VeriDou's dual perturbations approach found substantially more adversarial examples on networks that existing methods claimed to be highly robust.This shows that VeriDou is able to explore a broader range of unsafe regions and thus enhances formal assessment of robustness.
Paperid: 2680,   Poster  
Authors: Jianxun Mi, Xuanhui Zhong, Weisheng Li
Title: Improving Adversarial Transferability with Local Perturbation Augmentation
Abstract: Adversarial examples expose fundamental vulnerabilities within deep neural networks, and their transferability highlights shared weaknesses across diverse models. Existing mainstream attack methods often rely on iterative processes with various strategies to improve transferability, but the limited knowledge of the target model restricts the success of these approaches. In this paper, we reveal that the iterative optimization process tends to overspecialize adversarial perturbations to the local gradient characteristics of the surrogate model, thereby hindering their transferability to other models. To address this limitation, we propose a novel attack method called Local Perturbation Augmentation Attack (LPAA). The key innovation of our approach lies in constructing multiple augmented local subspaces during each iteration, which steers perturbation updates towards a more generalizable direction, effectively reducing over-reliance on the surrogate model. Additionally, to improve the initial performance and overcome sensitivity to initial perturbation, we introduce a dedicated perturbation initialization strategy that ensures the optimization process starts from a direction with greater ability for transferability. Compared with existing random neighborhood sampling strategies, the LPAA serves as an effective approach that leverages the directional characteristics of perturbations to overcome their limitations. Extensive experiments on CNNs and ViTs demonstrate that LPAA consistently generates highly transferable adversarial examples, significantly surpassing the performance of state-of-the-art methods.
Paperid: 2681,   Poster  
Authors: Youqi Pan, Wugen Zhou, Hongbin Zha
Title: Learning Differentiable Hierarchies in 3D Gaussian Splatting
Abstract: Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in realtime rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed within its value range.Our method requires no additional training stages and produces Gaussian hierarchies within training time comparable to classical 3DGS. Experiments on multiple datasets show that our approach achieves performance comparable to or surpassing state-of-the-art methods in both LoD rendering and model pruning.
Paperid: 2682,   Poster  
Authors: Yinhan Zhang, Yue Ma, Fangqiu Yi, Chenyang Qi, Chi Zhang, Kunyu Feng, Zeyu Wang
Title: Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
Abstract: We propose TeaAdapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive, while cascaded ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions.To overcome these issues, Tea-Adapter introduces a novel reversed distillation method, enabling large video diffusion models to inherit precise control capabilities from smaller, efficiently-tuned teacher diffusion models, eliminating full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost.To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames.Experiments demonstrate that Tea-Adapter enables high-fidelity multiple condition video synthesis, making advanced controllable video generation feasible on low-resource hardware and establishing a new efficiency standard for the field.
Paperid: 2683,   Poster  
Authors: Jeimin Jeon, Hyunju Lee, Bumsub Ham
Title: TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
Abstract: Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnetspecific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRA-Experts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly. Our code will be made publicly available upon acceptance.
Paperid: 2684,   Poster  
Authors: Yang Liu, Daxuan Ren, Yijie Ding, Jianmin Zheng, Fang Deng
Title: Bidirectional Query-Driven Generation of Parametric CAD Sketch
Abstract: Learningbased CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consistency throughout the evolving sketch state. In addition, a hybrid positional encoding integrates global modeling progression with local geometric semantics, reinforcing structural coherence during both learning and completion. Extensive experiments demonstrate that CADSketcher achieves superior geometric validity and instruction consistency across diverse sketch completion tasks, offering a robust and interpretable framework toward intelligent CAD automation.
Paperid: 2685,   Poster  
Authors: Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo
Title: Post-training feature pruning for fundus images classification
Abstract: Deep neural networks have achieved strong performance in fundus image classification, yet their flattened feature representations are often highly redundant. Such redundancy can lead to poor generalization across imaging devices, reduced interpretability, and inefficient use of model capacity. To address this issue, this study proposes a posttraining feature pruning framework, termed greedy feature pruning (GFP), which removes weak or redundant dimensions from the flattened features of trained backbones. GFP employs a greedy build-up process guided by performance metrics on the training set, constrained by a minimum feature keeping ratio, to identify compact yet discriminative subsets of features. Experiments are conducted on five public fundus datasets covering multiple tasks, including diabetic retinopathy detection (DDR, Messidor-2), glaucoma detection (PAPILA), multi-label classification (ODIR) and multi-class retinal disease classification (RETINA), using EfficientNetV2, ViT, and CoAtNet as backbones. Results show that GFP consistently improves AUROC and AUPRC across datasets while reducing the number of flattened features by up to 96%. Feature visualizations and quantitative analyses confirm that GFP enhances the compactness and separability of latent features. Moreover, cross-dataset evaluation demonstrates that GFP improves transferability between datasets, indicating better domain robustness. Overall, the proposed GFP framework provides a simple yet effective approach for compressing feature representations and improving both discriminability and generalization in fundus image classification.
Paperid: 2686,   Poster  
Authors: Zhi-Hao Guan, Longfei Huang, Yang Yang
Title: Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
Abstract: Visionlanguage pre-training (VLP) has achieved remarkable performance across diverse multimodal learning (MML) tasks. Recently, many efforts have focused on reconstructing missing modalities to improve the adaptability of VLP models in incomplete MML scenarios. However, these approaches overlook the learning imbalance under severe missing-modality conditions, i.e., the optimization process is dominated by reconstructed samples, thereby weakening complete-sample representations. In this paper, we propose a novel ANchor-guided Gradient Alignment (ANGA) framework to address these issues. Specifically, we first retrieve similar instances to reconstruct the missing modalities, thereby alleviating information deficiency. We then introduce an entropy-driven curriculum that progressively integrates reliable reconstructed samples with complete ones to form an optimization anchor, which guides gradient alignment to mitigate learning imbalance. Furthermore, we design a semantic-enhanced adapter that leverages the retrieved instances to generate dynamic prompts, further enhancing the robustness of the VLP model. Extensive experiments on widely used datasets demonstrate the superiority of ANGA over state-of-the-art (SOTA) baselines across various missing-modality scenarios.
Paperid: 2687,   Poster  
Authors: Minguk Kang, Suha Kwak
Title: FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
Abstract: Recent progress in video generation has shifted largescale models from convolutional architectures to Diffusion Transformers (DiT), yet latent-to-pixel video decoders remain predominantly convolutional. These decoders rely on heavy 3D convolutions, which slow down streaming generation and require spatial–temporal tiling to handle high-resolution or long-duration outputs. We introduce FlashDecoder, the first Transformer-based latent-to-pixel video decoder designed for streaming. FlashDecoder processes video latents frame-by-frame during both training and inference, applying bidirectional spatial attention within each frame while maintaining causal temporal dependencies through a rolling KV cache. Crucially, causality is enforced by sequential frame processing rather than explicit attention masks, enabling the use of memory-efficient bidirectional attention kernels throughout. This unified streaming approach ensures constant per-frame computation and bounded memory via a fixed-size KV cache with automatic eviction of older frames, enabling stable training at resolutions up to 720p. Integrated into the Wan2.2 video VAE, FlashDecoder matches the reconstruction quality of the convolutional decoder (PSNR 38.38 vs. 38.29; LPIPS 0.046 vs. 0.039) while decoding up to 4x faster—139 FPS at 480p and 69.6 FPS at 720p—achieving real-time high-resolution video decoding on a single H100 GPU.
Paperid: 2688,   Poster  
Authors: Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Zhou Xiao, Jie Zhou, Weidi Xie, Yanfeng Wang
Title: POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in crossmodal understanding and generation. However, the rapid growth of visual token sequences—especially in long-video and streaming scenarios—poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive visual reasoning and efficient long-form visual understanding. Model weights will be released upon publication.
Paperid: 2689,   Poster  
Authors: Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota
Title: PowerCLIP: Powerset Alignment for Fine-Grained Contrastive Pre-Training
Abstract: Contrastive pretraining frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics spanning multiple image regions.To address this limitation, we propose PowerCLIP, a novelcontrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees.As this approach increases computational complexity exponentially due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from \mathit\mathcalO(2^M) to \mathit\mathcalO(M) with respect to the number of regions M, provably approximating the exact loss value with arbitrary precision.Our extensive experiments demonstrate that PowerCLIPoutperforms state-of-the-art methodsin zero-shot classification and retrieval tasks, underscoring compositionality and robustness of our approach. Our code will be made publicly available.
Paperid: 2690,   Poster  
Authors: Hao Zhou, Tingjin Luo
Title: Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition
Abstract: Realworld visual recognition faces the fundamental challenge of long-tailed distributions. While state-of-the-art methods often employ multi-expert models to address different frequency categories, we find that the mutual knowledge distillation used in these models enhances collaboration at the cost of introducing two critical limitations: indiscriminate knowledge transfer leads to bias propagation, where a single expert's error can spread and contaminate others, and error consolidation, where mutual reinforcement of incorrect predictions solidifies erroneous consensus. To overcome these issues, we propose Trust-calibrated Collaborative Learning (TCL). Our framework introduces the trustworthy knowledge orchestration module, which enables reliable distillation and precise collaboration through a knowledge quality gate that blocks erroneous information and a tail-class compensation mechanism that alleviates knowledge scarcity for tail categories. Furthermore, we design a consensus error calibration module that suppresses consensus high-confidence negative classes to correct collective misjudgments and steer optimization in the right direction. Extensive experiments on five long-tailed benchmarks demonstrate that TCL achieves the best performance, raising Top-1 accuracy on CIFAR100-LT to 58.7%, a gain of 2.4% over previous SOTA methods.
Paperid: 2691,   Poster  
Authors: Zekun Qi, Xuchuan Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Wenyao Zhang, XinQiang Yu, He Wang, Li Yi
Title: Humanoid Generative Pre-Training for Zero-Shot Motion Tracker
Abstract: We introduce HumanoidGPT, the first GPT-style humanoid motion Transformer trained with causal attention on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility–generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks arbitrary humans executing highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Paperid: 2692,   Poster  
Authors: Yunsong Wang, Gim Hee Lee
Title: Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
Abstract: Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a categoryagnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency. Our code will be made publicly available upon paper acceptance.
Paperid: 2693,   Poster  
Authors: Peng Su, Xi Yang
Title: Rethinking Glyph Spatial Information in Font Generation
Abstract: Fewshot Font Generation (FFG) aims to create a complete font from a limited number of references, offering significant practical value. However, existing methods neglect glyph spatial information, which leads to two critical limitations. At the pipeline level, distorted rendering introduces spatial bias, impairing vectorization and dataset quality, and this problem is compounded by the lack of unified standards, which undermines a unified benchmark. At the model level, the implicit coupling of shape and position hinders fine-grained optimization and generalization. We address these challenges in the context of Chinese font generation, where glyph complexity demands superior model capability. Consequently, we first propose a Spatial-Preserving Rendering (SPR) protocol, which eliminates spatial bias and enables accurate vectorization. Alongside, we release an OFL-licensed Chinese font dataset to establish a unified benchmark. Then, technically, we propose GlyphSpatialNet, a two-stage framework to explicitly model glyph spatial information in pixel space. In first stage, we design a Shape-Position Decoupling (SPD) architecture and a Gradient Broadcasting Module (GBM) to achieve font style transfer in low resolution. In second stage, we design Style Detail Enhancement (SDE), which refines the style details for high resolution outputs. Extensive experiments demonstrate the effectiveness of our approach. Code and dataset are provided in the supplementary materials.
Paperid: 2694,   Poster  
Authors: Yanmin Li, Zhilong Mao, Mao Wang, Lihua Liu, Jibing Wu, Weidong Bao
Title: Prototype-based Causal Intervention for Multi-Label Image Classification
Abstract: Modern multilabel image classification models suffer from a critical reliance on spurious correlations, failing to learn the underlying causal mechanisms.Many causality-inspired methods are impractical, demanding box-level supervision that is rarely available in real-world datasets.Others rely on static confounder dictionaries, which are inherently inflexible and fail to capture complex biases or adapt to feature space changes during training.To address this, we present prototype-based causal intervention (ProCI), a novel framework that approximates the backdoor adjustment using only image-level supervision. It models confounders as learnable contextual prototypes which, unlike traditional prototypes designed for discriminative features, are engineered to represent class-wise co-occurring bias.These prototypes are learned dynamically within a stable memory and leveraged to construct sample-specific bias vectors for an adaptive feature adjustment, effectively counteracting spurious correlations.Experiments on MS-COCO, Pascal VOC, and the challenging Sewer-ML dataset validate our approach. ProCI achieves competitive performance on standard benchmarks while setting a new state-of-the-art on the highly-confounded Sewer-ML. It outperforms the previous best model by a remarkable +5.44 points on the primary F2_CIW metric. These results demonstrate the effectiveness of our approach in mitigating complex real-world biases using only image-level supervision.
Paperid: 2695,   Poster  
Authors: Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Ming Li, Xiaonan Luo, Guanbin Li
Title: Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
Abstract: Large visionlanguage models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available.
Paperid: 2696,   Poster  
Authors: Zhonghan Zhao, Yiming Zhang, Wenwei Zhang, Haiteng Zhao, Xingguang Wei, Zhangwei Gao, Kuikun Liu, Yuzhe Gu, Size Wu, Haian Huang, Jianfei Gao, haijun Lv, Demin Song, Yunhua Zhou, Qipeng Guo, Gaoang Wang, Kai Chen
Title: Exploring Visual Pretraining for Learning Language Intelligence
Abstract: While the most fundamental pretraining paradigm typically trains modalityspecific models on their respective datasets, the Platonic Representation Hypothesis that representations eventually align across modalities as data and model scale suggests an intriguing possibility: large language models (LLMs) could be pretrained on visual corpora to reach parity with text-pretrained models, thereby expanding data sources to break the text-scaling bottlenecks, and leveraging richer visual cues for more comprehensive corpus understanding. This paper makes the first attempt to demonstrate the feasibility of this implication by introducing Masked Autoregressive Pretraining for Learning language intelligencE (MAPLE), a novel visual pretraining paradigm for LLMs that leverages raw document images to improve language intelligence. MAPLE is universal to integrate masked auto-regressive models with various LLM backbones, where the LLMs are incentivized to generate latent hypotheses for the masked regions based on the unmasked regions. We verify MAPLE in the domain of math reasoning with multiple LLM backbones and show that MAPLE consistently surpasses text-only pretraining relatively by at most 40.2% on average accuracy across four math reasoning benchmarks. Further analyses show that visually pretrained LLMs learn a shared latent space that aligns document visuals with text and exploits layout and structural cues, supporting visual pretraining as a feasible and scalable route to stronger language models.
Paperid: 2697,   Poster  
Authors: Guangrui Li, Zhengyu Zhu, Yongxin Ge
Title: Mixture of Prototypes for Test-time Adaptive Segmentation
Abstract: TestTime Adaptive Segmentation (TTA-Seg) aims to adapt a trained segmentation model to test data under distribution shift in an unsupervised manner. Existing approaches typically utilize class-wise prototypes to capture and transfer the source distribution, but inevitably neglecting the diversity within source samples. In this paper, we propose a new test-time adaptation paradigm based on the mixture-of-experts (MoE), where domain experts are designed to 1) better capture the source distribution, and 2) dynamically adjust their contribution in test case prediction. Specifically, during source training, prototypes are derived as the class-wise average for source pixel features. We then generate multiple experts through clustering these prototypes, providing each class with several experts with enhanced representativeness. At test time, each pixel's prediction is drawn from all experts' knowledge in an adaptive manner, \ie, a gating network assigns weight according to pixel-expert correlation. To optimize the system, we devise a min-max entropy optimization scheme for the gating network but keeping the rest frozen, minimizing the entropy of model prediction but maximizing the entropy in expert selection. Consequently, the model is urged to derived confident predictions with effective utilization of domain experts, hence promoting the adaptation. Experiments on two scenarios, Test-time Adaptation (TTA) and the more challenging continual TTA, demonstrate that our approach achieves the new state-of-the-art.
Paperid: 2698,   Poster  
Authors: Sheng Li, Connelly Barnes, Mamshad Nayeem Rizve, Hongwu Peng, Zhengang Li, Ohi Dibua, Alireza Ganjdanesh, Xulong Tang, Yan Kang, Yifan Gong
Title: Content-Aware Dynamic Patchification for Efficient Video Diffusion
Abstract: Diffusion Transformers (DiTs) achieve strong video generation performance but suffer from prohibitive computation cost due to dense spatiotemporal tokenization. Most existing works rely on uniform patchification, tokenizing nonoverlapping spatiotemporal with a fixed patch size regardless of the underlying content. This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving the video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. A lightweight router predicts patch sizes directly from the latents encoded by 3D Variational Autoencoder (VAE), and is jointly optimized with the diffusion model through diffusion loss, an attention-guided saliency alignment loss, and a token-budget regularizer. Learnable patchify/unpatchify layers integrate seamlessly with standard DiT backbones, allowing flexible tokenization without architectural changes. Experiments demonstrate that DynaPatch can effectively reduce redundant computations while preserving fine details, achieving 1.3–1.8× acceleration with minimal quality degradation. On VBench, DynaPatch attains a Total Score of 83.42 at 30% token reduction, significantly outperforming prior patchification and token pruning approaches. These results indicate that content-aware patchification offers an effective direction for efficient and scalable video diffusion.
Paperid: 2699,   Poster  
Authors: Kaede Shiohara, Toshihiko Yamasaki
Title: Unified Vector Floorplan Generation via Markup Representation
Abstract: Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformerbased generative model, Floorplan Markup Language Model (FMLM), capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.
Paperid: 2700,   Poster  
Authors: Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Lenssen, Gerard Pons-Moll
Title: MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
Abstract: We introduce MoLingo, a textto-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text–motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.
Paperid: 2701,   Poster  
Authors: Changhao He, Di Xue, Shuxian Li, Yanji Hao, Xi Peng, Peng Hu
Title: Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
Abstract: Multiview learning fuses complementary views to improve perception, but real-world deployments often suffer from Test-time Noisy Correspondence (TNC)~—~cross-view misalignment caused by asynchronous sampling, transient network congestion, or other disturbances. Such misalignment introduces semantic inconsistency and significantly degrades performance. Existing remedies typically estimate view-specific reliability from clean, well-aligned training data and then extrapolate to noisy fusion at inference, resulting in a train-test task gap and reduced robustness against TNC. To bridge this gap, we propose \underline\textcolorredBootstrapping \underline\textcolorredMulti-view \underline\textcolorredLearning (BML)~—~a plug-and-play framework that explicitly learns to fuse under TNC. Specifically, BML performs in-place TNC bootstrapping to construct a controllable noise-augmented training set that simulates realistic correspondence distortion, thereby eliminating the task gap without external data. Unlike prior uncertainty-based approaches that model reliability in an unsupervised manner, BML presents a reveal-supervised paradigm, wherein a lightweight estimator jointly models intra-view predictive uncertainty (view quality) and inter-view prediction discrepancy (correspondence consistency) to produce calibrated reliability weights guided by both task objectives and bootstrapped supervision. Once deployed, these reliability weights directly modulate fusion, suppressing corrupted views while preserving informative ones. Across 11 benchmarks spanning diverse noise ratios, BML consistently outperforms state-of-the-art baselines and maintains robustness against TNC. Code will be released upon acceptance.
Paperid: 2702,   Poster  
Authors: Yang Zhou, Ping Ni, Jin Wang, Senyun Jia, Jingdan Yan, Kaixiang Huang, Guodong Lu, Jingru Yang, Shengfeng He
Title: Modeling the Visual Ambiguity of Human Sketches
Abstract: Human sketches provide a compact and expressive form of visual communication, but their sparse structural cues, while capturing essential object structures, introduce ambiguity because a single sketch can correspond to multiple plausible images, making crossdomain alignment uncertain and unstable. Such ambiguity fundamentally limits sketch-based vision tasks that rely on precise sketch--image correspondence. To address this challenge, we introduce AmbiScore, a metric that quantifies the ambiguity of sketch-image pairs, and use Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) as a testbed to reveal how ambiguous supervision leads to performance collapse in existing methods. We further propose DisAmb (Disentangling Ambiguity), a framework that explicitly models and mitigates ambiguity through two components: (1) Elastic Matching, which adaptively adjusts supervision strength using AmbiScore, and (2) Purified Matching, which employs ambiguity-agnostic masks to disentangle structure and appearance via shape jigsaw and texture swapping. DisAmb establishes new benchmarks under high ambiguity and provides a robust, transferable supervisory signal for downstream sketch-guided tasks.
Paperid: 2703,   Poster  
Authors: David Novikov, Eilon Vaknin Laufer, Narek Tumanyan, Mark Sheinin
Title: Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
Abstract: The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidthlimited to 30–60 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequentialcolor sequence. This results in simultaneous multi-view capture of the scene, in which high-speed temporal information is encoded in the images' color. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.
Paperid: 2704,   Poster  
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Title: Instance-level Visual Active Tracking with Occlusion-Aware Planning
Abstract: Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in realworld deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
Paperid: 2705,   Poster  
Authors: Arsha Nagrani, Jasper Uijlings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A. Ross, Cordelia Schmid
Title: Ego-STAR: Spatiotemporal Hints for Egocentric Video Understanding
Abstract: Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g., the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce EgoSTAR, a benchmark for evaluating complex egocentric visual reasoning. We extend recent highquality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, for which we also have spatio-temporal mask annotations. Through extensive evaluations, we identify that if frontier models are prompted with hints ofwhere' andwhen' to look, we can get substantial improvements in performance. EgoSTAR will be released publicly to foster progress in egocentric reasoning.
Paperid: 2706,   Poster  
Authors: Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy
Title: Efficient Decentralized Diffusion with Heterogeneous Training Objectives
Abstract: Training stateof-the-art diffusion models requires massive computational resources concentrated in tightly-coupled clusters, fundamentally limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) PixArt-\alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) a training-free inference conversion framework that unifies heterogeneous expert predictions (DDPM and Flow Matching) into a common velocity space without any retraining. Experiments on LAION-Aesthetics demonstrate that our decentralized approach achieves comparative results with 16× compute reduction (72 vs 1176 GPU-days) and 14× data reduction (11M vs 158M images). Our heterogeneous variant mixing DDPM and Flow Matching experts exhibits complementary specialization patterns, improving generation diversity and texture quality despite modest FID increases. By eliminating synchronization requirements and enabling arbitrary objective combinations, our framework democratizes large-scale generative model training, allowing contributors with diverse resources to participate using consumer GPUs requiring only 20-48GB VRAM.
Paperid: 2707,   Poster  
Authors: Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue
Title: Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
Abstract: Audiovisual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3~10%, and yields improved high-precision localization, particularly for single-sided forgeries.
Paperid: 2708,   Poster  
Authors: Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang
Title: Leveraging Verifier-Based Reinforcement Learning in Image Editing
Abstract: While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for textto-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we propose the Verifier-Based Reasoning Reward Model (RRM), which breaks instructions into verifiable principles, evaluates the edited images against each principle, and aggregates fine-grained scores to reduce hallucinations and provide more interpretable criteria. To address this, we argue the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework to build a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and leverage it into the downstream editing task. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each, and aggregates these checks to provide an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a “cold-start” to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
Paperid: 2709,   Poster  
Authors: WU Sitong, Haoru Tan, Bin Xia, Xichen Zhang, Jingyao Li, Shaofeng Zhang, Xiaojuan Qi, Bei Yu, Jiaya Jia
Title: Unlocking Token Rewards via Training-Free Reward Attribution
Abstract: In this paper, we propose an extremely efficient, trainingfree method to extract token-level reward signals directly from an existing deep reward model. Our core idea is to attribute the overall process reward to individual tokens by estimating each token's influence. This influence is defined as the change in the final macroscopic reward (e.g., the process reward) when a token is replaced with a semantically null token. Naively calculating this influence is computationally infeasible, requiring N forward passes through the PRM for an N-token sequence. We overcome this bottleneck by proposing a highly efficient gradient-based estimator. Specifically, we use a first-order Taylor approximation, which simplifies the influence calculation to the inner product of the difference between the token embedding and the null token embedding, and the gradient of the reward with respect to the token embedding. This requires only a single forward and backward pass. The resulting token-level rewards enable standard RL algorithms to perform precise credit assignment without requiring additional reward model training. Experiments on challenging reasoning benchmarks demonstrate that our method substantially improves policy optimization efficiency and enhances the generalization of LLM reasoning capabilities. Our P2T outperforms the outcome reward by +4.9% on MathVista for Qwen2.5-VL-7B-Instruct, and +11.5% on AIME24 for Qwen2.5-Math-7B, while with a around 4× faster convergence.Our results underscore the importance of fine-grained reward shaping and provide a simple, plug-and-play solution to unlock token-level supervision from existing PRMs.
Paperid: 2710,   Poster  
Authors: Weilong Yan, Li Haipeng, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, Jingyu Hu
Title: LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency
Abstract: This paper introduces LaSComp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Code and data will be released upon acceptance.
Paperid: 2711,   Poster  
Authors: Jong Wook Kim, Suyong Bahk, TaeHwa Lee, HyunDong CHO, Donghyun Kim, Sung-Chang Lim, Jin Soo Choi, Hui Yong Kim
Title: Block-based Learned Image Compression without Blocking Artifacts
Abstract: Learned Image Compression (LIC) outperforms traditional codecs but suffers from excessive peak memory usage when handling highresolution images. Consequently, block-based LIC has been studied to reduce peak memory and computational costs; however, this approach often introduces blocking artifacts that degrade visual quality.To mitigate this, the JPEG-AI standard introduced a patch-based scheme where overlapped blocks are coded independently using empirically determined overlap sizes. However, the experimental search for optimal overlaps is time-consuming and does not guarantee blocking-free reconstruction. To address these limitations, we propose an analytic framework modeling overlap propagation through convolutional and transposed convolutional layers to precisely determine the minimal overlaps for blocking-free reconstruction.Based on the minimum overlaps calculated, we provide the block-based implementation methodology for the convolution networks used in most CNN-based LIC models.Applied to four CNN-based LIC models on 4K images partitioned into 256×256 blocks, our method achieves rate–distortion performance identical to full-image coding while reducing average peak memory usage to 18.7% (encoder) and 17.9% (decoder), and only with average computational cost of 4.23% and 2.34%, respectively. Notably, the proposed block-based framework does not require any re-training of the original model. Furthermore, it can also be applied to most CNN-based image processing neural networks without worrying about any performance degradation.
Paperid: 2712,   Poster  
Authors: Yuanbo Wang, Xinning Wang, Zhaoxuan Zhang, Changlong Wang, qianchen xia, Xiaopeng Wei, Xin Yang
Title: TouchDream: 3D Object Completion through Imagined Touch
Abstract: Point cloud completion is crucial for robust 3D perception but remains challenging due to its illposed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optimize coarse points. Extensive experiments show that our TouchDream model achieves the state-of-the-art performance, significantly enhancing the recovery of local details.
Paperid: 2713,   Poster  
Authors: Manish Bhurtel, Danda Rawat
Title: RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
Abstract: Semantic segmentation in unstructured environments presents unique challenges due to irregular terrain, occlusions, and complex spatial layouts. While structured settings (e.g., urban scenes) have been widely studied, segmentation in unstructured settings remains relatively underexplored, both in terms of standardized benchmarking and architectural design. In this work, we propose a encoderdecoder based semantic segmentation architecture that integrates a Reduced Masked Autoencoder (RMAE) as the encoder, a Feature-to-Pyramid (F2P) neck, and a novel decoder called ProGRess. The ProGRess decoder introduces Progressive Leapwise Fusion (PLF) for top-down multi-scale fusion of non-contiguous feature maps, a Lightweight Channel Attention gate with Residuals (LCAR) module, and a Bottleneck Feature Fusion (BFF) block for compact refinement. We establish comprehensive baselines by benchmarking state-of-the-art CNN and transformer-based models on challenging unstructured environment datasets viz. RELLIS-3D, it's coarse-grained variant, and RUGD. Our architecture achieves the state-of-the-art performance with 57.41% mIoU on RELLIS-3D, 45.63% mIoU on RUGD, 78.95% mIoU on RELLIS-3DC datasets while maintaining competitive parameter-count and vRAM usage.
Paperid: 2714,   Poster  
Authors: Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liangyan Gui
Title: HandX+: Scaling Up Text-Conditioned Bimanual Motion Generation
Abstract: Textconditioned human motion and video generation have progressed rapidly, yet realistic hand motion and bimanual interaction remain significantly underexplored. Existing whole-body models often overlook the fine-grained details required for natural dexterous behavior, such as finger articulation, contact timing, and inter-hand coordination. We aim to close this gap by introducing a hand-centric animation framework. As a foundation, we consolidate large-scale motion data from diverse sources into a unified corpus with rigorous animation quality control. Through this process, we identify a limitation in most of the existing resources: the absence of high-fidelity bimanual motion data that capture nuanced finger dynamics and inter-hand collaboration. To remedy this, we collect a new dataset designed to enrich these underrepresented aspects. To scale motion-language alignment automatically, rather than relying on large language models to directly reason over raw motion sequences, we propose a decoupled paradigm. It extracts representative motion features, such as contact events and finger flexion, and then leverages LLM's reasoning to generate fine-grained, semantically rich descriptions aligned with these features.Building on our corpus and annotations, we develop benchmark models using diffusion and FSQ-based architectures and enable versatile conditioning modes, including standard text-conditioned generation, hand-reaction synthesis, motion inbetweening, keyframe-guided generation, and long-horizon temporal composition. Experiments show that our approach achieves strong text alignment, high-quality dexterous motion, and accurate contact prediction, supported by newly designed metrics tailored for hand animation. We additionally observe clear scaling behavior: larger models trained on larger, higher-quality datasets produce markedly more semantically coherent bimanual motions. All data will be released to support future research.
Paperid: 2715,   Poster  
Authors: Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, Hoifung Poon
Title: Open-Med-Reasoner: Data Recipes for Multimodal Medical Reasoning
Abstract: Highquality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning systems.
Paperid: 2716,   Poster  
Authors: Linhan Cao, Wei Sun, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Yicong Peng, Dandan Zhu, Guangtao Zhai, Xiongkuo Min
Title: Generalizable Video Quality Assessment via Weak-to-Strong Learning
Abstract: Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with humanlabeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers---including off-the-shelf VQA models and synthetic distortion simulators---via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment.
Paperid: 2717,   Poster  
Authors: Tianliang Qi, Xinhang Song, Yuyi Liu, Shuqiang Jiang
Title: Rethinking Visual Rearrangement from A Diffusion Perspective
Abstract: Rearranging disarrayed objects to their intended goal states requires the agent to comprehend the changes that have occurred in the scene and to reason about the process of these changes. To address this, we propose a novel perspective on the visual rearrangement task, drawing inspiration from the diffusion processes in molecular thermodynamics. We model the room shuffle and unshuffle stages as the forward and reverse processes of diffusion. In contrast to conventional methods that rely on scene modeling and differential comparisons, our approach provides insight into the intrinsic evolution process between the goal and initial states of the scene, which allows for a more reasonable rearrangement of objects through finegrained and progressive denoising steps with high confidence. By analyzing the task objectives, we represent the scene via spatial distributions of objects and model the visual rearrangement process using a diffusion bridge model. Building upon this, we introduce the Diffusion Rearrangement model, which takes point cloud data as input, fits it into Gaussian mixture distributions to represent the states of objects, and predicts the rearrangement target through an iterative denoising transformer. Experimental results on the RoomR dataset demonstrate the effectiveness of our approach.
Paperid: 2718,   Poster  
Authors: Sihong Huang, Jiaxin Wu, Dongmei Jiang, Yi Cai, Yaowei Wang, Xiaoyong Wei
Title: Compositional Transformation Reasoning for Composed Video Retrieval
Abstract: Composed Video Retrieval (CoVR) aims to retrieve a target video given a reference video and a textual modification describing the desired change. The core challenge lies in modeling compositional multimodal transformations, i.e., how objects, actions, and scenes evolve across video and language modalities in response to fine‑grained textual edits. Existing methods address this issue by training on large‑scale video–text–video triplets or by generating dense textual descriptions to capture subtle visual differences. However, these supervised approaches often rely on noisy web‑scale data and dataset‑specific correspondences, leading to overfitting and limited generalization in diverse or fine‑grained scenarios, while also failing to effectively model compositional and temporal transformations. To overcome these limitations, we propose a zero‑shot, fine‑grained transformation reasoning framework based on Multimodal Large Language Models (MLLMs). Our method decomposes the compositional transformation into three complementary reasoning dimensions, i.e., \emphentity, \emphaction, and \emphscene, and performs pairwise candidate reasoning to explicitly capture semantic evolution over time. Furthermore, we introduce a recall‑oriented multi‑objective candidate selection module that identifies high‑quality retrieval targets by jointly balancing visual, textual, and multimodal similarities before transformation reasoning. Experiments on EgoCVR and WebVid‑CoVR demonstrate the effectiveness of our method over state‑of‑the‑art approaches under the zero‑shot setting, with R@1 improvements of +5.8 and +10.8, respectively.
Paperid: 2719,   Poster  
Authors: Valter Piedade, Lalit Manam, Masashi Yamazaki, Pedro Miraldo
Title: Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
Abstract: Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to realtime localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene model, enabling efficient retrieval of relevant 3D points from previously reconstructed regions. Leveraging recent feed-forward geometry estimators, our hybrid design combines sparse keypoint-based localization with a dense anchor-point–driven spatial representation. This integration allows us to achieve real-time performance (exceeding 80 FPS) and a substantial efficiency improvement compared to existing uncalibrated monocular SLAM pipelines, while maintaining or improving localization accuracy.
Paperid: 2720,   Poster  
Authors: Sangbeom Lim, Seoung Wug Oh, Gabriel Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee
Title: VideoMaMa: Mask-Guided Video Matting via Generative Prior
Abstract: Generalizing video matting models to realworld videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model VideoMaMa that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video MA-V dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MAV to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos.These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.All models and the MAV dataset will be publicly released.
Paperid: 2721,   Poster  
Authors: Mengqi Yuan, Gengyun Jia, Bing-Kun Bao
Title: Self-Critical Distillation Network for Video-based Commonsense Captioning
Abstract: Videobased commonsense captioning aims to generate captions for the video content while providing multiple commonsense about the underlying events. Existing approaches rely on constructing a "video \rightarrow content caption \rightarrow commonsense" reasoning chain, which generates visually ungrounded commonsense and neglects inter-category commonsense correlations. Firstly, the existing reasoning chain induces the model's excessive reliance on content caption when generating commonsense, resulting in generic outputs with limited visual relevance. Secondly, the reasoning chain adopts multiple isolated decoders for commonsense generation, which fails to leverage the correlations between different categories of commonsense. To address these limitations, we introduce a novel self-critical distillation network (SCD-Net), which optimizes the reasoning chain by enhancing visual reasoning and establishing inter-category commonsense correlations. Specifically, on the one hand, we introduce self-critical learning and design a reward function to allow the model to refine its output. This mechanism incentivizes the model to maximize the utilization of visual information, thus improving the model's capacity for visual comprehension. On the other hand, we propose a joint reasoning distillation framework that fosters mutual inference among diverse commonsense categories. In this framework, we incorporate the cascaded decoder and knowledge distillation strategy to facilitate inter-category commonsense knowledge transfer while maintaining the fairness of the testing. Our experiments on the large-scale Video-to-Commonsense dataset demonstrate that our approach performs favorably against state-of-the-art methods. The code will be released soon.
Paperid: 2722,   Poster  
Authors: Baoquan Zhang, Zhehao Yu, Lisai Zhang, Kenghong Lin, Tianran Chen, Yuxi Sun, Yunming Ye, Yao He
Title: S$^{2}$FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
Abstract: Parameter Efficient FineTuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change \Delta W is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change \Delta W with uniform spectrum.To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called \textS^2\textFT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons.Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our \textS^2\textFT achieves superior performance by only using 0.08% training parameters.
Paperid: 2723,   Poster  
Authors: Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun
Title: Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
Abstract: While neural video codecs (NVCs) have recently demonstrated superior performance over traditional codecs through endto-end learning, existing approaches primarily focus on architectural enhancements and coding module design, with limited exploration into optimizing hierarchical structures—specifically, quality and reference configurations. Current hierarchical structure optimization methods face two major limitations: (1) insufficient content-adaptive optimization, and (2) disjointed handling of quality and reference structures. To overcome these challenges, we propose a novel NVC framework that introduces content-adaptive hierarchical structure optimization through a hierarchical hyperprior derived from the current frame. Our NVC integrates two key components: (1) a hierarchical hyperprior extracted from the original frame to enable content-aware adaptation of the hierarchical structure; and (2) an adaptor within the hierarchical hyperprior codec combined with a dual-reference scheme, guided by the hyperprior, to jointly optimize quality and reference structures. By leveraging this content-adaptive hierarchical structure, our NVC achieves state-of-the-art rate-distortion performance, outperforming the previous leading NVC method DCVC-FM with BD-rate reductions of 15.51% and 12.20% relative to VTM-23.4 low-delay B (LDB) under intra-period settings of -1 and 32, respectively.
Paperid: 2724,   Poster  
Authors: Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai
Title: View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
Abstract: AerialGround Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts under varying viewpoints. Extensive experiments on three large-scale AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06% mAP improvement on the challenging CARGO cross-view protocol. The code will be available.
Paperid: 2725,   Poster  
Authors: Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, ZihanWang ZihanWang, Sirui CHEN, Wenkai Cheng, Kanghao Chen, Hongfei (Faye) Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen
Title: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chainof-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we proposeTiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i)Structural Reasoning & Search, ii)Spatial & Visual Pattern Reasoning, iii)Symbolic & Logical Reasoning, and iv)Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduceVideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
Paperid: 2726,   Poster  
Authors: Vinayak Gupta, Chih-Hao Lin, Shenlong Wang, Anand Bhattad, Jia-Bin Huang
Title: Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
Abstract: Reconstructing 3D scenes from sparse, unposed images remains challenging under realworld conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and a new 20-scene MegaScenes benchmark demonstrate state-of-the-art feed-forward reconstruction quality, achieving real-time inference without test-time optimization.
Paperid: 2727,   Poster  
Authors: Yida Niu, Xinhai Chang, Xin Liu, Ziyuan Jiao, Yixin Zhu
Title: AutoMoMa: Scalable Coordinated Mobile Manipulation Trajectory Generation
Abstract: Mobile robots need coordinated wholebody motion to perform household tasks effectively. Current mobile manipulation datasets rely on expensive teleoperation or slow planning methods, limiting available data to hundreds of demonstrations. This data scarcity severely constrains the development of generalizable learning-based policies. Here, we demonstrate that GPU-accelerated planning generates up to 5,000 episodes per GPU hour, over 80 × faster than existing methods. Our AutoMoMa pipeline produces 500K diverse physically valid whole-body motions across 300 household scenes and multiple robot embodiments, compared to previous datasets limited to narrow robot-scene pairs with a few hundred demonstrations. Downstream validation demonstrates consistent policy improvements with large-scale training data. This work provides the first scalable solution to the mobile manipulation data bottleneck. By enabling massive dataset generation, AutoMoMa accelerates progress toward general-purpose household robots capable of complex coordination tasks.
Paperid: 2728,   Poster  
Authors: Xuwei Qian, Jinghui Zhang, Yuchuan Tan, Wenbo Huang, Zhen Wu, Shen Zhou, LiSha Gao, Ding Ding, Fang Dong
Title: OS-Fed: One Snapshot Is All You Need
Abstract: Reducing communication overhead in federated learning (FL) is challenging but crucial for largescale distributed privacy-preserving machine learning. Unfortunately, directly compressing model updates often leads to sub-optimal convergence due to information loss, while increasing local computation can cause model divergence. Hence, this paper proposes a drastically different approach that adheres to the maxim that ``a picture is worth a thousand words''. We observe that the entire gradient information from local training can be effectively reconstructed from a compact, image-like representation. Based on this observation, we propose a novel approach, OS-Fed, which performs One-Shot Federated Learning by transmitting only a single, compact snapshot (comprising an image and a set of learnable labels) per round. To realize this approach, OS-Fed presents new snapshot synthesis techniques to (1) target the accumulated update of a trajectory segment to tackle gradient noise, (2) design a multi-grid snapshot that decouples conflicting gradient directions, and (3) incorporate error compensation to maintain training stability under extreme compression. Extensive experiments on CV and NLP benchmarks show that OS-Fed reduces communication costs by 1.5-16× compared to state-of-the-art algorithms , resulting in 18-45% faster convergence.
Paperid: 2729,   Poster  
Authors: Nan Jiang, yunhao li, Lexi Pang, Zimo He, Siyuan Huang, Yixin Zhu
Title: MotionMaster: Generalizable Text-Driven Motion Generation and Editing
Abstract: Textdriven human motion generation struggles with complex multi-action sequences and precise editing tasks due to limited training data diversity, inadequate motion representations, and fragmented generation pipelines. We present MotionMaster, a framework that addresses these challenges. First, we introduce MotionGB, a 10,000-hour motion dataset created from 400 hours of manually verified motion capture data, enriched with multi-level descriptions, then expanded through spatial-temporal editing while maintaining precise motion-text correspondence. Second, we develop a motion representation method that encodes local frame-wise features into discrete tokens while employing sequence-level reconstruction to preserve global trajectory coherence. Third, we finetune the pre-trained multimodal LLM with motion and language tokens in a shared embedding space, enabling end-to-end understanding of motion semantics. We propose a technique to address unbalanced motion semantics in the dataset. Evaluated using a Gemini-based scorer validated against human judgments, MotionMaster demonstrates strong generalization: it achieves state-of-the-art zero-shot motion generation ability, demonstrating a 41.6% relative improvement over baselines in semantic consistency for long multi-action sequences and a 20.8% relative improvement in coordinating complex body part specifications for spatial composition tasks. These results represent a strong generalization across language and motion modalities.
Paperid: 2730,   Poster  
Authors: Jiawei Cao, Junyi Feng, Jiashen Hua, Ziheng Huang, Bing Deng, Kaijie Wu, Chaochen Gu, Jieping Ye
Title: Illuminating Visual Identity in Universal Multimodal Embeddings
Abstract: Universal Multimodal Embeddings (UMEs) aim to unify various modalities and tasks into a shared representation space. In recent years, this field has witnessed substantial progress driven by the development of Multimodal Large Language Models (MLLMs). However, a crucial capability, visual identity discrimination, remains underexplored in existing UME methods, despite its critical role in a wide range of tasks, including instance retrieval, reidentification, and identity preservation in AI-generated content (AIGC).To bridge this gap, we propose a unified formulation for visual identity discrimination and introduce MIEB (Multimodal Visual Identity Embedding Benchmark), a large-scale benchmark curated from both real-world and synthetic datasets to support evaluation and training.Furthermore, we present a simple yet effective learning framework that jointly optimizes general multimodal and visual identity representations through a carefully designed identity-aware sampling mechanism.Extensive experiments demonstrate that our approach successfully endows UMEs with strong identity discrimination capability and maintains competitive general multimodal performance.We believe this work not only illuminates a critical yet neglected capability, but also takes a step toward more holistic universal multimodal embeddings.
Paperid: 2731,   Poster  
Authors: Haifeng Wu, Wei Long, Shuhang Gu, Lixin Duan, Wen Li
Title: iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
Abstract: Recent advances in feedforward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes.Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve appearance details, we design a region-aware enhancement mechanism that applies targeted multi-view and monocular feature aggregation to resolve ambiguities in both overlapping and non-overlapping areas.We validate iSplat's robustness and generalization on in-domain (RealEstate10K, ACID) and cross-dataset (DTU, ACID) benchmarks. With only 42.6M parameters, iSplat surpasses DepthSplat (354M) on RealEstate10K (PSNR: 27.67 vs. 27.47 dB). Crucially, on the cross-dataset DTU benchmark, it further boosts the PSNR by 2.88 dB (18.26 vs. 15.38 dB), showcasing exceptional generalization. These results highlight the significant potential of iterative refinement to overcome the inherent limitations of one-shot approaches.
Paperid: 2732,   Poster  
Authors: Bingliang Zhang, Wenda Chu, Yizhuo Li, Linjie Yang, Yisong Yue, Katie Bouman, Yang Song, Qiushan Guo
Title: SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model
Abstract: We present Scalable Pixelanchored End-to-end Diffusion (SpeeDiff), a latent diffusion method that jointly trains the VAE and the diffusion model from scratch. In principle, joint training allows the diffusion loss gradient to directly guide the VAE encoder, encouraging the formation of a generation-friendly latent space and potentially yielding faster convergence than the conventional two-stage approach with a pretrained frozen VAE. However, a naive end-to-end implementation severely degrades performance, as unrestricted backpropagation of the diffusion loss leads to latent space collapse. Our main technical contribution is a simple yet effective Tweedie Pixel Reconstruction (TPR) loss, which provides additional pixel-level feedback by decoding a predicted clean latent from an intermediate noisy state using Tweedie's formula, thereby alleviating collapse. Furthermore, our method enables jointly scaling a fully transformer-based architecture and enhances representation alignment within the end-to-end framework. Our SpeeDiff-XL model achieves over 140× and 61× faster training compared to Vanilla SiT and REPA, respectively, while attaining an FID of 1.50 without guidance on ImageNet 256×256 generation. With a more efficient 32× compressed VAE, our model further reaches an FID of 1.53 without guidance on ImageNet 512×512 generation.
Paperid: 2733,   Poster  
Authors: Zijie Chen, Guiyun Fan, Zhaoxing Yang, Rong Ding, Haiming Jin
Title: μVLM: A Vision Language Model for μNPUs
Abstract: The proliferation of lowpower intelligent processors with integrated Neural Processing Units (NPUs), called \muNPUs, has created new opportunities for on-device generative AI, benefitting end devices like smart wearables and small robots. However, deploying Vision-Language Models (VLMs) on \muNPUs is severely hindered by stringent memory constraints and limited operator support. To bridge this critical gap, we propose \muVLM, the first lightweight-oriented VLM architecture designed for \muNPUs. It is comprised of our proposed OverMod encoder and AttSSM decoder. OverMod is a lightweight dynamic convolutional network inspired by biomimetic vision, incorporating our novel Global Spatial Modulation mechanism to enable adaptive, high-fidelity feature extraction using only NPU-friendly operators. AttSSM leverages a highly efficient State Space Model (SSM) core, augmented with multi-scale feature fusion and Global Context Dynamic Modulation mechanism, to perform robust sequential modeling. Furthermore, we introduce a coordinated full-parameter quantization strategy that preserves precision across the encoder-decoder boundary, alongside hand-optimized operators for unsupported modules like SSMs. \muVLM achieves a competitive CIDEr score of 117.8 on the COCO Karpathy test split and, for the first time, demonstrates the feasibility of millisecond-level VLM inference on a \muNPU platform.
Paperid: 2734,   Poster  
Authors: Kedar Tatwawadi, Parisa Rahimzadeh, Zhanghao Sun, Zhiqi Chen, Ziyun Yang, Sanjay Nair, Divija Hasteer, Oren Rippel
Title: What Matters in Practical Learned Image Compression
Abstract: One of the major differentiators unlocked by learned codecs relative to their hardcoded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed.In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime — including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3× bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms — faster than most top ML-based codecs run on a V100 GPU.
Paperid: 2735,   Poster  
Authors: Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig
Title: Mechanisms of Object Localization in Vision–Language Models
Abstract: Visuallygrounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis.We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while internal structure is largely ignored. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early–mid layers for LLaVA and mid–late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads.Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.
Paperid: 2736,   Poster  
Authors: Jinhyung Park, Navyata Sanghvi, Erica Weng, Shawn Hunt, Shinya Tanaka, Hironobu Fujiyioshi, Kris Kitani
Title: Grounded Latents for Entity-Centric 4D Scene Generation
Abstract: Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxelbased representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose to perform generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents through time. Our framework produces physically consistent and temporally coherent 4D scenes, supporting controllable and realistic generation.
Paperid: 2737,   Poster  
Authors: Soye Kwon, Keonyoung Lee, Dahuin Jung, Jaekoo Lee
Title: FEAT: Fashion Editing and Try-On from Any Design
Abstract: Fashion design aims to express a designer’s creative intent and to depict how garments interact with the human body. Recent generative approaches condition on multimodal inputs to support garment editing and enable virtual tryon. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT Fashion Editing and Try-On from Any Design, a method that enables editing and try-on across both garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
Paperid: 2738,   Poster  
Authors: Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Zanyi Wang, Dengyang Jiang, Liebucha Wu, Bo Cheng, Yuhang Ma, Dawei Leng, Yuhui Yin
Title: RefTON: Person-to-Person Virtual Try-On with Unpaired Visual References
Abstract: We introduce RefTON, a fluxbased person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTON streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTON leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTON achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.
Paperid: 2739,   Poster  
Authors: Gaoyang Zhang, Xinguo Liu
Title: LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Abstract: The task of multiview inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP. Extensive experiments demonstrate that LaRP outperforms prior methods in 3D consistency and novel view synthesis quality competitive with the state-of-the-art, while being ∼50x faster.
Paperid: 2740,   Poster  
Authors: Guanting Guo, Shenglong Hu, Kaihua Zhang, Guangcan Liu, Min Xia
Title: Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
Abstract: This paper presents a generalizable CoSOD framework via mixed contentstyle modulation, termed CoMCS, to enhance the robustness of the model to unseen domains. The CoMCS, consisting of a mixed content modulator (MCM), a mixed style modulator (MSM), and a collaborative semantic contrast module (SCM), effectively extracts scene structure priors as well as augments the source domain styles to bridge the domain gap between the source and the unseen domains. Specifically, the CoMCS first utilizes the CLIP model to extract conceptual knowledge associated with the semantic classes in the whole scene, resulting in multi-class semantic embeddings that are domain-invariant. Subsequently, the MCM models the semantic relationships between the prototypes of co-salient objects and the multi-class semantic embeddings through the cross-attention mechanism, effectively capturing domain-invariant scene structure priors that aid in reducing scene distribution shift in unseen domains. Meanwhile, to alleviate domain perturbations encountered during testing, the MSM addresses the uncertainty associated with domain shifts by synthesizing feature statistics, such as mean and standard deviation, during training to simulate new stylistic characteristics, thus achieving data augmentation within the source domain. Finally, to reduce the ambiguity of the co-salient object representations within test data from unseen domains, the SCM employs a uniform loss function to ensure that the learned prototypes are uniformly distributed within the hyperspherical space, further enhancing the domain generalization capabilities of the framework. Besides, to further verify the generalization ability of the CoMCS to unseen domains, we construct an unseen-domain benchmark dataset (UND) that selects a variety of image groups with unseen classes from CoCA, CoSOD3k, CoSal2015. Extensive evaluations on the four benchmark datasets demonstrate favorable performance of our CoMCS to a variety of state-of-the-art methods.
Paperid: 2741,   Poster  
Authors: Weiran Pan, Wei Wei, Wenfeng xie
Title: Debiased Sample Selection for Learning with Noisy Labels
Abstract: Existing methods for learning with noisy labels (LNL) predominantly rely on the smallloss trick, assuming that low-loss samples are more likely to be correctly labeled. While effective, this strategy suffers from two overlooked confirmation biases: (1) Class-level confirmation bias: samples from easy-to-learn classes tend to have lower losses, leading to over-selection of easy samples while ignore hard ones; (2) Instance-level confirmation bias: mislabeled samples with spuriously low loss are mistakenly treated as clean, forcing the model to memorize wrong labels. Both biases accumulate over training and degrade performance. To mitigate these issues, we propose Marginal Distribution Adjustment (MDA) and Candidate Class Selection (CCS). MDA dynamically reshapes the model’s predicted class distribution toward uniformity, ensuring more fair sample selection across classes. CCS leverages training dynamics to identify likely true labels and removes them from the classification task, preventing memorization of incorrect annotations while converting weakly related labels into useful supervision. Both MDA and CCS are plug-and-play modules. Extensive experiments show that integrating MDA and CCS into either existing sample selectors or advanced LNL pipeline consistently enhances performance on both CIFAR-10/100 with synthetic noise and real-world datasets (CIFAR-N, Clothing1M, WebVision), demonstrating their broad applicability in LNL methods. Our code will be publicly available.
Paperid: 2742,   Poster  
Authors: Ted Lentsch, Santiago Montiel-Marín, Holger Caesar, Dariu M. Gavrila
Title: TerraSeg: Self-Supervised LiDAR Foundation Model for Ground Segmentation
Abstract: LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDARbased scene understanding and navigation is ground segmentation. Existing methods are either handcrafted for specific LiDAR configurations or require costly per-point manual labels, limiting generalization and scalability. We introduce TerraSeg, establishing the first self-supervised LiDAR foundation model for ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes LiDAR data from nine major public benchmarks, spanning over 20 million raw scans and 11 distinct sensor models, providing unprecedented diversity for learning a generalizable ground model. OmniLiDAR is pseudo-labeled by our PseudoLabeler, a novel self-supervised module that generates high-quality ground/non-ground labels through per-scan runtime optimization. Without any manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception, and delivers close-to-real-time performance. Our code and models will be publicly released upon paper acceptance.
Paperid: 2743,   Poster  
Authors: Wang Ma, Hanjing Wang, Yufei Zhang, Darsha Udayanga, Qiang Ji
Title: Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
Abstract: Bayesian deep learning (BDL) integrates Bayesian inference with deep learning, improving predictive performance while enabling principled uncertainty quantification. However, existing BDLs often rely on noninformative random priors, limiting the benefits of Bayesian inference. In contrast, knowledge-augmented deep learning explicitly injects domain knowledge during training, yet lacks a probabilistic foundation. In this paper, we propose a knowledge-augmented BDL framework that integrates domain knowledge both as an informative prior and as an adaptive likelihood under a unified two-stage hybrid formulation. In the first stage, we learn a knowledge-informed prior p(\theta \mid \mathcalK) by pre-training a model to satisfy domain-specific constraints. In the second stage, we perform Bayesian inference on task data with an adaptive knowledge likelihood p(\mathcalK \mid \theta, \mathcalD), which dynamically enforces these constraints during optimization. This unified framework enables knowledge to guide both initialization and training, significantly improving prediction accuracy, robustness, adaptation and uncertainty estimation. Experiments on various computer vision tasks, including semi-synthetic and real-knowledge scenarios, demonstrate that our two-stage framework consistently outperforms state-of-the-art Bayesian and knowledge-augmented baselines.
Paperid: 2744,   Poster  
Authors: Luoxi Jing, Dianxi Shi, YuShe Cao, Yuanze Wang, Junze Zhang, Yuning Cui, Mengzhu Wang
Title: AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
Abstract: Monocular depth estimation is critical for applications like autonomous driving and robotics. The complementary properties of event and image modality motivate the fusionbased methods for robust depth estimation. However, existing fusion methods rely on convolutional or attention-based architectures, which either struggle with global dependencies or incur high computational cost, limiting their suitability for long-sequence modeling in depth tasks. Besides, effective image-event fusion remains a key challenge, as most existing methods directly fuse features without addressing the domain gap and differences in representational characteristics between raw events and images, leading to semantic bias and degraded performance. In this work, we propose AIMDepth, an Asymmetric Image-Event Mamba framework for monocular depth estimation, built entirely on state space models to ensure linear computational complexity and accurate prediction. To address input-domain misalignment, we introduce a Spectral Cross-modal Prior Guidance (SCPG) module that performs bidirectional prior injection at the input level. To mitigate representational imbalance between sparse events and dense images, we design an Asymmetric Modal-aware Encoder (AME) that allocates separate encoding paths for each modality and facilitates feature-level alignment tailored to their distinct information densities. To further enhance fusion, we develop a Modality-interactive Local Refinement (ModiLocal) module that enables hierarchical interaction and fine-grained alignment through SSM-based modeling. Extensive experiments on public datasets demonstrate that AIMDepth achieves state-of-the-art performance and strong robustness in complex environments.
Paperid: 2745,   Poster  
Authors: Soo Won Seo, Kyungchae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Title: Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
Abstract: Human–Object Interaction (HOI) detection aims to localize human–object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision–Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instancecentric Context Mining Network (InCoM-Net)—a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instance-centric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features and Progressive Context Aggregation (ProCA), which iteratively fuses these multi-context features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods.
Paperid: 2746,   Poster  
Authors: Jinzhou Tang, Sidi Liu, Waikit Xiu, weixing chen, Keze Wang
Title: LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
Abstract: A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic taskspecific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.
Paperid: 2747,   Poster  
Authors: Wenxuan Ge, Qu Hongyu, Rui Yan, Guo-Sen Xie, Yazhou Yao, Xiangbo Shu, Jinhui Tang
Title: Condensed Test-Time Adaptation of VLMs for Action Recognition
Abstract: Testtime adaptation for video understanding, which enables vision-language models (VLMs) to generalize to downstream tasks such as action recognition, has demonstrated substantial value in real-world applications. Existing memory-based methods typically build a visual cache from high-confidence test videos and perform inference via a two-step modality mapping chain, i.e., vision-vision and vision-text. However, due to the asymmetry of the two mappings, the chain exhibits non-transitivity, hindering the generalization of VLMs. To this end, we propose a novel training-free Condensed Dynamic Adapter ConDA for action recognition, which leverages vision-text alignment to guide vision-vision alignment. It first selects semantic patches based on the semantic activation probability obtained from the vision-text alignment (Probability-based Semantic Patch Selection, PSPS), and then adaptively constructs spatial-temporal video tubes based on patch-level visual similarity (Adaptive Tube Construction, ATC). We conduct extensive experiments on seven benchmarks with different backbones and baselines. The quantitative results demonstrate that ConDA is compatible with arbitrary VLM and generalizes well across complex scenarios, such as long-term and egocentric scenarios. In addition, qualitative analyses showcase the interpretability of ConDA in terms of capturing semantic cues.
Paperid: 2748,   Poster  
Authors: Tobias Dorszewski, Jens Hjortkjær
Title: Seeing Conversations: Communication Context Identification in Egocentric Video
Abstract: In everyday conversations, humans effortlessly recognize communication partners using visual cues such as gaze or head orientation. Replicating this social reasoning in computer vision is challenging, especially in dynamic, multiperson settings. We introduce Communication Context Identification (CCI) in egocentric vision: Given a first-person video sequence, determine which individuals are engaged in communication with the camera wearer. To support CCI, we collected a challenging large-scale dataset comprising 68.9 hours of egocentric video captured across diverse multi-person, multi-conversation scenarios.We propose CoCoNet, a temporal interaction model for CCI that tracks social dynamics via attention across individuals over long time scales. CoCoNet flexibly handles varying group sizes, maintains predictions through occlusions, and performs robustly even with limited temporal input. Leveraging long temporal contexts, it achieves 96% balanced accuracy on CCI. Performance varies with group size and spatial scene layout, highlighting the importance of dataset diversity. Our work advances vision-based conversational awareness, enabling applications in assistive hearing that use egocentric video to enhance individuals in the user’s conversation group.
Paperid: 2749,   Poster  
Authors: Donglai Xiang, Vismay Modi, Rishit Dagli, Ty Trusty, Gilles Daviet, Anka H. Chen, Nicholas Sharp, David I.W.
Title: FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
Abstract: We present a novel formulation for meshfree, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40× training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of objects in different representations including meshes and Gaussian splats, as well as the application of our method in the downstream task of robot simulation.
Paperid: 2750,   Poster  
Authors: Xinyang Wang, Qian Liu, WENJIE DING, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, XianPeng Lang, Wei Chen
Title: Unifying Language-Action Understanding and Generation for Autonomous Driving
Abstract: VisionLanguage-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language–action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method (C2F) that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
Paperid: 2751,   Poster  
Authors: jiahan huang, Ran Ran, Junming Hou, Zihao Chen, Xiaofeng Cong, Junling Li, Liang-Jian Deng
Title: Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening
Abstract: Pansharpening, a fundamental image preprocessing technique in remote sensing, aims to generate spatially and spectrally enriched multispectral imagery by integrating complementary information from texture-rich panchromatic (PAN) images and paired low-resolution multispectral (LRMS) counterparts. Although recent generative diffusion models have achieved impressive fusion quality, these performance gains often come with substantial computational costs, rendering them impractical for resource-constrained scenarios common in remote sensing applications. This work introduces a function-space diffusion model built upon a neural operator architecture that achieves compelling performance with promising efficiency. Specifically, our framework replaces the standard attention-based denoising backbone with a Galerkin-type neural operator, significantly reducing computational complexity while maintaining excellent representational capacity. Furthermore, by explicitly integrating pixel-wise spatial-spectral consistency residuals into each reverse diffusion step, our method establishes a fine-grained, closed-loop guidance mechanism that dynamically calibrates spatial details and spectral fidelity throughout the generation process. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach over state-of-the-art methods.
Paperid: 2752,   Poster  
Authors: Chunpu Xu, Zhixuan Liang, Tianshuo Yang, Chi-Min Chan, Yang Xiao, Jessie Wang, Xiaokang Yang, Yao Mu
Title: MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models
Abstract: Recent works on visionlanguage-action (VLA) models have made great progress in exploring action tokenizers that convert continuous control signals into discrete tokens to align with LLM/VLM training paradigms.These approaches typically train a single tokenizer over entire manipulation trajectories, which often comprise multiple distinct skills and thus pose a challenging optimization trade-off.To address this issue, we introduce MoEActok, a novel action tokenizer that employs a mixture-of-experts (MoE) quantizer to produce skill-aware discrete representations for VLA models. MoEActok utilizes a clustering-driven MoE VQ-VAE mechanism in which each expert specializes in a particular skill.The key components are: (a) an action-skill decoupling strategy that uses k-means clustering to group action chunks, aligning clusters having similar skills; (b) a skill-aware training paradigm that augments VLA models with skill-conditioned context, improving skill grounding; and (c) an adapter that projects shared encoder representations into skill-specific latent spaces for specialized quantization, and subsequently harmonize the heterogeneous quantized representations back into a unified space for coherent reconstruction by the shared decoder.We evaluate MoEActok-based VLA models against multiple prior action tokenizer baselines in the RoboTwin and Simpler-Env simulators, and further assess zero-shot transfer on three real-world tasks. Across both simulated and real-world settings, MoEActok-based VLA substantially outperforms existing discrete tokenization methods.
Paperid: 2753,   Poster  
Authors: Zheng Liu, Zijian He, Huiguo He, Weizhi Zhong, Yejun Tang, Huan Yang, Kun Gai, Guanbin Li
Title: SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling
Abstract: Recent advances in image editing allow impressive manipulation of objects, existing methods still struggle to handle spatial manipulation in complex scenes, such as objects span different depth layers or are partially occluded.Most image editing methods focus solely on 2D datasets prior information, emphasizing planar features while lacking support for spatial positional structures. Even approaches that incorporate explicit positional information fail to capture true 3D spatial relationships, thus limiting accurate object movement in complex scenes.In this paper, we present SpatialDiff, a method that effectively captures 3D spatial structures, enabling precise and consistent object movements in complex scenes.Our core innovations are twofold: (1) Implicit 3D Spatial Modeling, which introduces 3D prior knowledge and enables the model to internally build a comprehensive understanding of the threedimensional spatial structure; and (2) Global Spatial Supervision, which constrains the latent spatial features to enable the model to perceive changes in object spatial positions caused by editing operations.Experimental results demonstrate that our method significantly improves the accuracy and fidelity of spatial manipulation in complex scenes.
Paperid: 2754,   Poster  
Authors: shihao Zou, Wei Wei
Title: CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion
Abstract: Dynamic multimodal fusion methods lack robust theoretical guidance for handling modal conflicts and inconsistent data quality. While recent theorybased works correlate weights with indirect scalar proxies (e.g., loss or confidence), this paradigm struggles to comprehensively capture the risk driven by direct distribution inconsistencies. In this paper, we propose a Conflict-driven Risk Minimization (CoRiM) dynamic fusion paradigm. Specifically, we redefine dynamic fusion as a principled, per-sample, direct risk minimization task. To this end, we first design a novel, differentiable Modality Conflict Risk (MCR) function, \mathcalR(w), which quantifies risk by directly modeling fused uncertainty and inter-modal consistency. Second, we identify that minimizing \mathcalR(w) is fundamentally a non-convex constrained optimization problem over the probabilistic simplex. To efficiently solve this specific challenge, we innovatively introduce the projection free Frank-Wolfe (FW) algorithm, as it is perfectly suited for optimization on the simplex.We prove that our designed \mathcalR(w) possesses L-smoothness, which provides theoretical guarantees for the convergence of the FW algorithm on our non-convex objective. Extensive experiments on multiple benchmark datasets demonstrate that CoRiM outperforms current state-of-the-art methods in high-conflict and noisy environments, validating the robustness of our method.
Paperid: 2755,   Poster  
Authors: Feng Yang, Jie Zhao, Fulin Luo, Anyong Qin, Tiecheng Song, Yue Zhao, CHENQIANG GAO, Junwei Han
Title: MPL: Match-guided Prototype Learning for Few-shot Action Recognition
Abstract: Current fewshot action recognition methods achieve impressive performance by learning representative prototypes and designing diverse video matching strategies. However, these approaches typically face two critical limitations: i) prototypes learned through implicit sample interactions lack clear semantic correspondence between query-support pairs, limiting their class representativeness; ii) the independent design of prototype learning and matching mechanisms creates a potential incompatibility between prototype representations and matching strategies. To address these limitations, we propose a Match-guided Prototype Learning (MPL) method comprising two key components: enhanced match (E-Match) and key-frame extraction match (K-Match). E-Match explicitly enhances prototype learning in class-specific embeddings by incorporating the matched semantics of query samples, while K-Match further refines the prototype representation through key-frame matching at the fine-grained frame level. Additionally, we propose a Cross-Shot Attention Aggregator (CSA-Aggregator) that dynamically aggregates adjacent frames across support samples, thereby obtaining a prototype representation that captures intra-class shared action patterns. In this way, the proposed MPL effectively mines coarse-to-fine, match-guided semantic information from query-support pairs to generate discriminative class prototypes, and improve the compatibility of prototype representation with the match mechanism. Extensive evaluations on four public datasets confirm that MPL achieves superior performance over leading few-shot action recognition techniques.
Paperid: 2756,   Poster  
Authors: Jiajia Wei, YuJia He, Yuhan Hou, Hang Qi, Sihua Wang, Jincheng Shi, Kwok Li, Zibin Zheng, Weibin Wu
Title: Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
Abstract: Most existing evaluations of generated videos adopt a noreference paradigm. Although recent benchmarks cover multiple dimensions and show moderate correlation with human preferences, relying solely on textual prompts weakens real-world constraints and makes it difficult to produce accountable and interpretable judgments on instance-level issues such as target behavior deviation, temporal inconsistency, and commonsense violations. In scenarios with explicit expectations, such as controlled generation, reference videos naturally provide rich, unambiguous spatio-temporal evidence, enabling stricter and more trustworthy assessment. Motivated by this, we propose Ref4D, a reference-based, fine-grained, multi-dimensional benchmark for generated video evaluation. Ref4D contains 600 high-quality reference videos with tightly evidence-bounded prompts, and introduces a 12-metric structured evaluation suite along four key dimensions: basic semantic alignment, motion consistency, event temporal consistency, and world knowledge consistency. Experiments on eight text-to-video models show that Ref4D achieves stronger agreement with human judgments than representative no-reference frameworks, while precisely diagnosing the dimensions and causes of failure for each video. By integrating explicit reference evidence with multimodal reasoning, Ref4D provides a practical and human-aligned standard for generated video evaluation and a tool to guide the development of more reliable generative models.
Paperid: 2757,   Poster  
Authors: Yujie Xue, Meng Wang, Ruihui Li, F anWu, Liu Zhi-Zhong, Zhuo Tang, Kenli Li
Title: Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
Abstract: Camerabased Semantic Scene Completion (SSC) is able to comprehensively understand the entire scene, but it suffers from ambiguous predictions due to occlusions and incomplete information. Temporal SSC alleviates this issue, but existing models simply stack multi-frame temporal features, which can lead to inconsistencies between geometry and semantics over time. In this paper, we present ConSSC, a novel SSC method that learns Spatial-Temporal Consistency. It works by lifting historical frames into a 3D scene-level occupancy framework, aggregating 2D and 3D historical features from current voxels, and learning from 2D visibility and similarity cues in a temporal buffer. Specifically, our framework introduces two key components: the Hierarchical Voxel Refinement module, which extracts a coarse occupancy from depth and refines it through voxel-level representations, recovering missing information. The Temporal Semantic Aggregation module effectively integrates semantic features from different viewpoints and time points, enabling the reconstruction of occluded regions in the current frame using historical context, aggregating them into corresponding voxel features. Without additional sensors or data, ConSSC improves both geometric and semantic consistency. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets show that ConSSC outperforms state-of-the-art camera-based and temporal SSC baselines by a significant margin in terms of IoU and mIoU.
Paperid: 2758,   Poster  
Authors: Kai Li, Wenqi Ren, Wei Wang, Xiaochun Cao
Title: Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
Abstract: This paper aims to present a robust AIgenerated image detection framework designed to address performance degradation caused by image compression in online social networks. The key challenges are twofold: 1) compression destroys fragile artifacts that are crucial to existing methods, and 2) it introduces new compression artifacts that interfere with detection. Existing methods typically enhance the compression robustness by collecting original-compression pairs and compression labels. However, the collection and annotation process is highly resource-intensive. To address these issues, we propose a Compression-Robust Phase-Harmonized Transformer, motivated by the observation that phase spectrum remains stable under compression. The framework consists of a phase-harmonized cross-modal interaction module that leverages phase spectrum information for feature fusion, enhancing compression robustness, and a multi-domain modulation adapter that further refines fused features while enabling parameter-efficient fine-tuning. In particular, the framework operates without requiring compression-original data pairs and compression labels. When limited compression labels are available, we introduce a difficulty-aware consistency loss to maximize their utility by prioritizing hard compressed samples during training, further boosting robustness. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches, exhibiting superior robustness against image compression.
Paperid: 2759,   Poster  
Authors: Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li
Title: MoCha: End-to-End Video Character Replacement without Structural Guidance
Abstract: Controllable video character replacement with a userprovided identity remains a challenging problem due to the lack of paired video data.Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies.In this paper, we propose MoCha, a pioneering framework that mitigates these limitations by harnessing the inherent tracking ability of the video diffusion model, therefore requiring only a single arbitrary frame mask and no structural guidance.To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage.Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs.Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches.We will release the code and dataset to facilitate further research.
Paperid: 2760,   Poster  
Authors: Duc Nguyen, Tat-Jun Chin, Minh Nguyen Nguyen
Title: MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment
Abstract: We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate crossmodal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities.
Paperid: 2761,   Poster  
Authors: Raziuddin Mahmood, Tanveer Syeda-Mahmood
Title: Phrase-grounded APO for Improving Chest X-ray Report Generation
Abstract: The deployment of automatic radiology report generator (RRG) models in clinical workflows is being hampered by the lack of factual correctness in the produced reports. Existing methods to improve the report generators use alignment approaches that require pairs of ground truth preferred and dispreferred responses. As these are not available at inference time in clinical workflows, new alignment methods are needed to improve report quality at inference time. In this paper, we present a new phrase-grounded automatic preference optimization (APO) alignment method which offers such improvement during inference without needing additional ground truth. Specifically, the method generates surrogate ground truth preference data for alignment automatically from the RRG model response itself though fact-checking and LLM-prompted correction. We also develop a novel APO loss function that combines preference response alignment loss with phrasal grounding loss paying attention to both the description of the finding and its image location. We show that this method of alignment, on the average, improves the report quality at inference time by 30-40% across various SOTA report generators as tested on multi-institutional chest X-ray datasets.
Paperid: 2762,   Poster  
Authors: Shaojie Zhuang, Guangshun Wei, Jiangxin He, Yuanfeng Zhou
Title: Photo-Guided Tooth Segmentation on 3D Oral Scan Model
Abstract: Accurate 3D tooth segmentation is fundamental for digital dentistry, orthodontic analysis, and clinical simulation. Intraoral scan (IOS) models often suffer from incomplete or unreliable texture information, making it difficult to delineate fine boundaries between teeth and gingiva, while 2D intraoral images provide rich semantic and chromatic information that can complement 3D geometry. Thus, we propose a novel Photoguided 3D Model Tooth Segmentation framework, PMTSeg, that enhances 3D tooth segmentation by integrating texture cues from intraoral photos. Our framework introduces three key components: a Camera Alignment Module (CAM) for accurate image-model registration, a Feature Filtering Gate (FFG) for adaptive multi-view feature selection, and a Consistent Feature Learning (CFL) mechanism for learning texture-geometry correspondence. Our method supports arbitrary numbers and views of intraoral photos. Experiments show significant improvements in distinguishing adjacent teeth and tooth–gingiva boundaries, demonstrating that intraoral photographs serve as an efficient, semantically rich supplement to 3D scans for precise dental segmentation.
Paperid: 2763,   Poster  
Authors: Xianhan Zeng, Xiaoxiao Hu, Sheng Li, Zhenxing Qian, Xinpeng Zhang
Title: Bridging Privacy and Provenance: Traceable Virtual Identity Generation
Abstract: Recent advances in generative models have enabled the creation of highfidelity human faces, yet constructing reliable virtual identities that preserve user privacy while supporting consistent and verifiable identity assignment remains challenging. In this paper, we propose a diffusion-based framework for generating traceable virtual identities with stable identity semantics, pose and expression preservation. Our framework couples a virtual identity sampler that generates diverse but consistent identity embeddings with a 3D geometry and expression conditioning module that preserves the pose and non-identity characteristics of the input face. In addition, we incorporate a lightweight latent watermarking mechanism that embeds an imperceptible identity signature during generation, enabling a user to verify ownership of the resulting virtual identity through a secure token without revealing their real facial appearance. Quantitative evaluations demonstrate that our method achieves high identity consistency across repeated sampling, strong pose and expression fidelity, and improved anonymity compared with prior work. These results validate the effectiveness of integrating virtual identity sampling, geometric conditioning, and latent watermarking into a single generative framework, and highlight the practical potential of our solution for constructing privacy-aware virtual identities.
Paperid: 2764,   Poster  
Authors: Farhat Shaikh, Ayan Banerjee, Sandeep Gupta
Title: EMMA: Extracting Multiple physical parameters from Multimodal Data
Abstract: We introduce EMMA, a physicsinformed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data.
Paperid: 2765,   Poster  
Authors: linchun wu, Qin Zou, Yuanhao Yue, Zhongyuan Wang
Title: Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection
Abstract: Point cloud anomaly detection is crucial in automated manufacturing, with reconstructionbased diffusion methods emerging as a mainstream solution. However, these approaches still face two major challenges: (1) geometry violation, where random noise perturbations deviate from local surface normals, causing structural distortion; and (2) undistinguished reference regions, where uniformly applied coarse anomaly embeddings during denoising blur normal details and impede accurate anomaly recovery. To address these issues, we propose AARD, a geometry-aligned and anomaly-aware diffusion reconstruction framework. We argue that high-fidelity anomaly detection requires a principled reformulation of the diffusion process: noise should align with geometry to preserve structures, and reconstruction can be better guided by anomaly-aware references to discriminatively recover normal details while correcting defects. AARD progressively aligns noise directions with vertex normals while maintaining vertex-graph consistency, and employs an adaptive transformer that assigns normal references to anomalous regions and input references to normal areas. Experiments on Anomaly-ShapeNet and Real3D-AD show that AARD consistently outperforms state-of-the-art approaches, achieving superior geometric fidelity and robust anomaly localization.
Paperid: 2766,   Poster  
Authors: Hidir Yesiltepe, Koutilya PNVR, Gaurav Suresh Pathak, Navaneeth Bodla, Bharat Singh, Pinar Yanardag, Jinrong Xie
Title: DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Abstract: Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifierfree guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AI-generated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4000 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.
Paperid: 2767,   Poster  
Authors: Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang
Title: ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
Abstract: The rise of AIgenerated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.
Paperid: 2768,   Poster  
Authors: JIAMU SUN, Zhiyuan Yan, Ke-Yue Zhang, Taiping Yao, Shouhong Ding
Title: DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
Abstract: Developing generalizable deepfake detectors has become increasingly important with the rapid advancement of generative models. Adapting visual foundation models (VFMs), e.g., CLIP, through parameterefficient finetuning (PEFT), with only a small subset of parameters updated, has been proven highly effective for generalizable detection. However, the success of “fewer-parameters” training raises an important question: although only a few parameters are tuned, have existing PEFT-based detectors truly exploited the most informative ones while eliminating redundant parameters for better generalization? In this work, we move beyond standard PEFT by proposing a joint optimization strategy that operates at both the layer and token levels. Since latent features across layers capture different semantic abstractions and tokens within the same layer convey varied forgery cues, we propose integrating both layer-level and token-level routing to maximize representational synergy. Specifically, at the layer level, we introduce "Early Layer Pruning", an adaptive truncation mechanism that enables the model to adaptively learn distinct forward depths for different types of instances. At the token level, "Token Selection" is guided by the Spearman rank loss to filter tokens irrelevant to forgery learning, enabling the model to focus on the most discriminative cues. Furthermore, a unified MoE architecture is applied that encourages diversity and thus reduces the potential model's overfitting to specific forgery types. Extensive benchmarking results demonstrate the effectiveness of our designs and show the superior performance of our method over existing state-of-the-arts.
Paperid: 2769,   Poster  
Authors: Xuelu Feng, Tianyu Luan, Zixin Zhu, Akshobhya Sharma, Phani Nuney, Junsong Yuan, Chunming Qiao
Title: Learning 3D Shape Fidelity Metric from Real-world Distortions
Abstract: 3D generation and reconstruction have become essential in many computer vision applications, where the reconstructed or generated 3D shapes need to appear realistic to human perception. However, traditional metrics like Chamfer Distance to compare two 3D shapes focus primarily on matching accuracy of the shape geometry and fail to capture perceptual fidelity in the shape. While frequencybased metrics attempt to analyze shape details in the spectral domain, they still do not fully encapsulate the complexity of human perception. To address this gap, we propose a human-aligned fidelity metric that leverages local shape connectivity through a local attention mechanism to capture rich, detailed shape information. We also introduce the two-branch Real Shape Fidelity (RSF) dataset, including a main subset and test-only subset. This dataset generates 3D mesh distortions using real-world reconstruction and generation methods and annotated by hundreds of human subjects. Our metric named Local-Connection-based Shape Evaluation (LoCaSE), utilizes a PointNet-based backbone combined with Low-Rank Adaptation (LoRA)-style pretraining and finetuning to reduce model bias, while maintaining translation, rotation, and scale invariance. Experiments demonstrate that our approach achieves superior alignment with human perception compared to previous metrics.
Paperid: 2770,   Poster  
Authors: Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, ZhiYuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang
Title: Towards Hierarchical 3D Spatial Understanding in Vision-Language Models
Abstract: Achieving humanlike spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that generates over 1 billion 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised finetuning. We also develop an RGB-D VLM that incorporates metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence in future VLMs.
Paperid: 2771,   Poster  
Authors: Senyao Li, Haozhao Wang, Zhaobai Jiang, Zhanbo Jin, Hao Fan, Ruixuan Li
Title: E$^2$-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
Abstract: In edge–cloud environments, the efficiency of speculative decoding is heavily constrained by uplink transmission and cloudside verification. In this work, we identify a phenomenon we term credit inertia, where the acceptance rates of adjacent token windows exhibit strong temporal consistency. Tokens following recently well-performing windows are likely to pass verification, whereas tokens following poorly performing windows are likely to fail. Motivated by this observation, we propose E^2-SCI, an elastic edge–cloud speculative decoding framework that dynamically adjusts draft token verification thresholds based on recent historical performance. This adaptive mechanism allows the system to be more permissive for windows with strong historical performance and stricter for windows with weak performance, effectively leveraging temporal consistency to reduce overall latency. We further introduce Progressive Lookahead Concurrency (PLC), which pipelines draft generation and verification asynchronously to hide latency. Experiments across multiple benchmarks show that E^2-SCI achieves over 9.4 tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), delivering an 88.5% speed improvement over the FSD baseline while maintaining accuracy. Notably, E^2-SCI integrates seamlessly with existing frameworks (e.g., EAGLE-3), demonstrating broad applicability and superior efficiency–quality trade-offs.
Paperid: 2772,   Poster  
Authors: Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang
Title: Accelerating Streaming Video Understanding via Hierarchical Token Compression
Abstract: Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in realtime deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose Streaming Token Compression (STC), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: STC-Cacher, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and STC-Pruner, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to 99% of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by 24.5% and 45.3%.
Paperid: 2773,   Poster  
Authors: Jinyu Han, changguang wu, Fuming Sun, Jinhui Tang
Title: Beyond Appearance: Camouflaged Object Detection via Geometric Structure
Abstract: Depth priors provide salient geometric structure that benefits camouflaged object detection (COD), but directly using Monocular Depth Estimation (MDE) causes a task misalignment that still fails to identify camouflaged objects.To address this issue, we propose the Depth Segment Anything Model (DepthSAM), a MDEadapted method specifically designed to mitigate this misalignment.DepthSAM incorporates two core innovations: (1) a Sparse Mixture-of-Experts Adapter (SMEA) that enables DEM to learn semantic information unique to camouflaged scenes, and (2) a Geometric–Semantic Fusion Module (GSFM) that efficiently integrates geometric cues with high-level semantics. With these components, DepthSAM achieves both robust semantic understanding in camouflaged environments and accurate segmentation of camouflaged objects.Extensive experiments show that DepthSAM achieves new SOTA performance on three major benchmarks. For example, on COD10K, its S_\alpha and F_\beta^\omega metrics surpass the best competing methods by 3.0% and 4.3%, respectively.
Paperid: 2774,   Poster  
Authors: Yuanpeng Tu, Yunpeng Chen, Xi Chen, Liang Li, Hengshuang Zhao
Title: Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
Abstract: Recent unified models have made remarkable strides in generating highquality images, yet they consistently fail on reasoning-intensive tasks, i.e., solving mazes, assembling tangrams. Intriguingly, we find that vision-language models (VLMs) and large language models (LLMs) can accurately solve these tasks, but cannot generate the corresponding images because they lack a structured visual output interface. This reveals that the core bottleneck is not reasoning capacity, but the lack of a structured interface to translate high-level reasoning into precise visual output. To bridge this gap, we propose using code-structured visual hints (i.e., SVG/HTML) overlays that explicitly encode reasoning steps directly on the image plane. Accordingly, we develop an automatic data construction pipeline that can generate high-quality code-structured hints for existing datasets and train a unified model called Hint2Gen based on FLUX.1 Kontext to condition its generation on such hints. Furthermore, to comprehensively evaluate the effectiveness of our approach, we introduce Reason2Gen, a benchmark comprising 4,000 samples spanning 20 categories across 7 core dimensions, including path connectivity, spatial assembly, etc. Extensive experiments demonstrate that even simply providing such hints as extra inputs—without any retraining—boosts their performance. And our model significantly outperforms all leading open-source/closed-source methods on reasoning-aware generation and editing across all the dimensions.
Paperid: 2775,   Poster  
Authors: Chaoqun Sun, Zongjing Fu, Powei Chang, Jinpeng Zhang, JianXiang Xiang, Yukang Gao, Chenyu Wang
Title: GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration
Abstract: Diffusion transformer models deliver stateof-the-art image synthesis quality but suffer from prohibitively slow iterative sampling. Fewer sampling steps accelerate inference but inevitably distort intermediate features and degrade visual fidelity, while offering little relief in computational cost. To address these limitations, we present GeoRK2, a training-free framework that bridges numerical analysis and information geometry. GeoRK2 couples second-order Runge–Kutta (RK2) integration with a curvature-aware geometric flow derived from the model's noise predictions, establishing provably stable feature evolution dynamics under manifold-aware integration. By leveraging an empirical feature covariance–induced metric estimated from gradient covariances to capture intrinsic feature geometry and applying parallel transport along the manifold connection, GeoRK2 constrains error propagation under large-step integration, ensuring both numerical stability and structural fidelity. As a fully plug-and-play method, GeoRK2 requires no retraining and is compatible with mainstream pretrained diffusion transformers. Comprehensive experiments on image generation and super-resolution tasks across representative diffusion backbones (e.g., DiT-XL, HunyuanVideo, and FLUX.1-dev) demonstrate that GeoRK2 achieves 4–5× faster inference than baseline frameworks (FORA, TaylorSeer) with only marginal perceptual differences (∆FID ≈ 0.81), confirming its effectiveness and generality. All implementation details and code are provided in the supplementary material.
Paperid: 2776,   Poster  
Authors: Hao Dong, Yujin Liu, Haoyue Liu, Zhenyu Wang, Shihan Peng, Zhiwei Shi, Yi Chang, Luxin Yan
Title: Tracking through Severe Occlusion via Event-Derived Transient Cues
Abstract: Tracking targets with highspeed and nonlinear motion under occlusion remains challenging due to spatial appearance deprivation and temporal trajectory fragmentation caused by missing visual cues. Existing methods typically either dynamically update templates to maintain appearance similarity or employ autoregressive models to predict targets from historical trajectories. However, these methods are ineffective under severe occlusion owing to template contamination and limited frame rates for complex motion. In this work, we observe that occlusion inherently degrades the spatial matching mechanism, highlighting the importance of temporal cues. Meanwhile, event cameras with microsecond-level temporal resolution provide transient dynamic cues that facilitate modeling nonlinear motion. In light of this, we propose EvoTrack, an occlusion-robust tracking framework via event-derived transient evolution, which comprises event-based motion autoregression and target-aware appearance matching. Specifically, for motion autoregression, the fine-grained timestamps of events naturally encode the target's direction and speed, motivating a bidirectional motion consistency that constrains inter-frame displacement prediction under nonlinear motion. For appearance matching, we adopt a Gaussian masking strategy to simulate occlusion degradation, guiding the model to focus on target regions and learn invariant representations. Furthermore, we build a pixel-aligned Frame-Event tracking dataset with higher spatial resolution and explicit occlusion labels. Extensive experiments demonstrate the effectiveness of EvoTrack in challenging occlusion scenes.
Paperid: 2777,   Poster  
Authors: Shuo Wang, Zhichuan Wang, Jun Luo
Title: OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance
Abstract: Generative data augmentation techniques open new avenues for improving image recognition models. The core of image recognition lies in accurately capturing the ontological features of the subject. However, existing methods often treat the image as a whole during augmentation, ignoring the uneven semantic distribution between foreground and background. This can lead to semantic shifts in generated samples, weakening the model’s ability to represent the subject’s ontology. In human perception, category recognition typically relies on the stable essence of the subject while tolerating variations in background and environment. Inspired by this human perceptual mechanism of “stable subjects, diverse backgrounds, and overall coherence,” we propose OntoAug, a data augmentation framework based on the distinction between ontology and environment that redefines the boundary of ontologyoriented enhancement. OntoAug explicitly separates the foreground subject and background context, guiding diffusion models through structured layout control to generate samples with consistent subjects and diverse backgrounds. Experiments show that OntoAug significantly improves performance in image classification, few-shot learning, weakly supervised object localization (WSOL), and large vision-language model (LVLM) reasoning, demonstrating its advantages in semantic fidelity and sample diversity. It offers a new direction for building visual systems more aligned with human perception. Code will be available.
Paperid: 2778,   Poster  
Authors: Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang
Title: Lifting Unlabeled Internet-scale Data for 3D Scene Understanding
Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage webcurated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We systematically identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-level reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
Paperid: 2779,   Poster  
Authors: Boya Liao, Ying Li, Siyong Jian, Huan Wang
Title: Parallel Jacobi Decoding for Fast Autoregressive Image Generation
Abstract: Autoregressive (AR) models have demonstrated remarkable performance in generating highfidelity images.However, their inherently sequential next-token prediction leads to significantly slows inference.Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation.Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence.Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement.PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability.Extensive experiments on diverse datasets show that PJD achieves4.8×–6.4×acceleration across multiple autoregressive image generation models while preserving image quality.
Paperid: 2780,   Poster  
Authors: Xin Zhang, Liang Bai, Guanchao Wang, Xian Yang
Title: Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
Abstract: ExemplarFree Class Incremental Learning (EFCIL) aims to enable models to learn new classes sequentially without retaining samples from previous tasks. While recent approaches leverage pre-trained models with parameter-efficient tuning to mitigate forgetting, they often overlook a crucial cause of forgetting: the collapse of the class-discriminative structure. This structure comprises two interdependent components: intra-class structure, which characterizes the shape of individual classes, and inter-class structure, which characterizes the global geometric relationships among class prototypes. We reveal that catastrophic forgetting stems from the simultaneous deterioration of both intra-class and inter-class structures. To address this, we propose a unified framework that preserves the class-discriminative structure. It preserves the intra-class structure by reshaping class means and covariances to preserve each class’s shape during migration, and maintains inter-class structure by stabilizing angular relationships between samples and old prototypes. Extensive experiments demonstrate that our framework outperforms existing leading methods on multiple EFCIL benchmarks, validating that preserving the class-discriminative structure is crucial for mitigating catastrophic forgetting.
Paperid: 2781,   Poster  
Authors: liang peng, Bohan Tan, Zhipeng Zhang, Haobo Li, Yifan Jiao, Xingping Dong, Libo Zhang
Title: Towards Visual Query Localization in the 3D World
Abstract: Visual query localization (VQL) aims to predict a spatialtemporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query localization. To facilitate comparison for subsequent research, we implement a series of representative 3D multimodal VQL baselines using PC and RGB. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift and attention fusion algorithm named LaF, which significantly outperforms than existing baseline models. Our benchmark and model will be publicly released.
Paperid: 2782,   Poster  
Authors: JIan Yu, Yujian Feng, Shuai You, Zhongkai Zhou, Fei Wu, Zhengjun Jing, Yimu Ji
Title: Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
Abstract: Occluded visible–infrared person reidentification (Occluded VI-ReID) remains difficult due to modality heterogeneity and occlusions, both of which break structural consistency and weaken cross-modality feature alignment. Existing methods rely mainly on spatial-domain cues (such as local body parts and salient patches), but their discriminability degrades severely under varying imaging conditions or partial visibility. To address these issues, we introduce a spatial-frequency collaborative perspective that offers global perception and cross-location consistency. Specifically, we propose a Spatial-Frequency Collaborative Learning (SFCL) framework that uses frequency information to complement spatial representations. SFCL comprises a Cross-Modality Frequency Alignment Module (CFAM), a Spatial-Frequency Interaction Module (SFIM), and a Frequency-Aware Discriminative (FAD) loss. The CFAM models the spectral features of visible/infrared images in the frequency domain, establishing modality-consistent spectral priors. The SFIM injects these priors into spatial features, promoting dual-domain interaction and complementary representations of spatial and frequency semantics. In addition, the FAD loss jointly enforces cross-modality frequency alignment and semantic consistency, thus enhancing robustness and discriminability under occlusions. For real-occlusion evaluation, we construct two occluded datasets, Occ-SYSU-MM01 and Occ-RegDB, on which SFCL outperforms the state-of-the-art.
Paperid: 2783,   Poster  
Authors: Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen, Zhikai Hu, Weiwei Cai
Title: InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search
Abstract: Trainingfree neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we validate is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.
Paperid: 2784,   Poster  
Authors: Shihua Zhang, Tianhao Xu, Zizhuo Li, Qing Ma, Jiayi Ma
Title: SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
Abstract: Imageto-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a compact semantic extraction scheme encoding each 3D point as a low-dimensional semantic probability distribution, offering effective guidance with minimal storage. A bidirectionally-aligned fusion block merges geometric features with semantic context for more unified and consistent representations. Additionally, semantic priors guide the 2D-3D information exchange within the interaction framework from a high-level semantic perspective. Extensive indoor and outdoor experiments validate that SAG-GNN achieves state-of-the-art results in descriptor-free 2D-3D matching and visual localization, with low storage and strong generalization.
Paperid: 2785,   Poster  
Authors: Sixian Zhang, Yiyao Wang, Xinhang Song, Keming Zhang, Zijian Xu, Shuqiang Jiang
Title: Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning
Abstract: Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multiscale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target localization and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner.
Paperid: 2786,   Poster  
Authors: Zhuojie Wu, Shijie Wang, Xin Yu
Title: MeToM: Metadata-Guided Token Merging for Efficient Video LLMs
Abstract: Video Large Language Models (VLLMs) encounter significant computational challenges due to the large volume of visual tokens generated from multiple frames.Existing visual token pruning methods fail to account for the uneven spatiotemporal information density, thus squandering scarce token budgets on regions with low information density.In this paper, we propose a trainingfree Metadata-guided Token Merging framework (MeToM) that leverages intrinsic video metadata to adaptively allocate budgets and merge visual tokens based on content complexity.Specifically, MeToM exploits residual from the metadata as spatial information density cues.It merges less informative regions during tokenization, avoiding redundant encoding and improving the efficiency of the visual encoder.Additionally, MeToM captures temporal variations in information density by utilizing the average Group of Pictures (GoP) size to represent scene complexity.This mechanism enables dynamic per-frame token allocation that adaptively adjusts token budgets across time, assigning more tokens to content-complex frames and fewer to simple ones.Finally, inside the LLM, we merge low-contribution visual tokens via multi-layer attention to compact the prefill FLOPs and visual KV cache.Extensive experimental results demonstrate that MeToM outperforms the prior SoTA counterparts, achieving 2.65× inference speedup against the baseline VLLM, while still improving the performance, without training.
Paperid: 2787,   Poster  
Authors: Mengmeng Ge, Takashi Isobe, Xu Jia, Yanan Sun, Zetong Yang, Weinong Wang, Dong Zhou, Dong Li, Huchuan Lu, Emad Barsoum
Title: Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Abstract: Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multistep transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.
Paperid: 2788,   Poster  
Authors: Phuc Nguyen, Anh N Nhu, Ming Lin
Title: OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
Abstract: We introduce OpenVO, a novel framework for Openworld Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world–scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage.To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks – KITTI, nuScenes, and Argoverse 2 – achieving more than 20% performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%–92% lower errors across all metrics.These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.
Paperid: 2789,   Poster  
Authors: Long Ma, Haoze Zheng, Yuhang Mao, Jinyuan Liu, Chengpei Xu, Xinwei Xue, Yi Wang, Xiangjian He, Weimin Wang
Title: BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation
Abstract: Underwater instance segmentation is essential for finegrained scene understanding. However, underwater imagery exhibits a strong domain gap from in-air vision due to severe degradation (e.g., turbidity). Consequently, despite its general segmentation ability, SAM degrades sharply underwater. In this work, we propose BiPA, which effectively adapts SAM to the underwater domain. To be concrete, we construct an underwater SAM with dual prompts and introduce a foreground-attentive injection block to enhance local foreground representation. We formulate dense prompt learning as a bilevel optimization, explicitly capturing the mutual dependency between prompt and model. To make this tractable, we design a two-stage learning strategy. The first stage adapts the dense prompt itself, updating it with Bayesian optimization to learn efficiently. The second stage fine-tunes the model parameters under the frozen optimized prompt, which finally enables effective cross-domain adaptation. Extensive experiments and analyses verify the superiority and efficiency of BiPA. Code will be released if this work can be accepted, fortunately.
Paperid: 2790,   Poster  
Authors: Die Zuo, Lubo Wang, Ruonan Liu, Qing Guo, Chong Wang, Dongdong Wu, Wei Feng, Kairui Yang, Di Lin
Title: Multi-modal Frequency Decomposition Network for Semantic Scene Completion
Abstract: Based on a RGBD image pair, semantic scene completion (SSC) provides a description for 3D scene understanding by predicting 3D semantic occupancy map. Recent methods extract RGB-D multi-modal features and fuse them in spatial domain, which disregards the misalignment caused by the imperfect raw multi-modal data and the multi-modal feature learning. Moreover, the operations of extracting high-level features they utilized tend to introduce feature smoothing and detail loss, exacerbating the above misalignment. To tackle these problems, this paper introduces MFDNet, a lightweight semantic scene completion network based on a multi-modal frequency decomposition strategy. By integrating frequency processing with limited layers of convolution and downsampling, MFDNet achieves a balance between modalities alignment and detail retainment. The network is equipped with Multi-modal Adaptive Frequency Fusion (MAFF) and Frequency Detail Compensation (FDC). MAFF models the intra-modal multi-bands dependencies and inter-modal relationships from a global perspective, enabling modality-specific calibration while facilitating the aligned fusion of multi-modal features. FDC excavates the high-frequency cues in shallow features to compensate for the missing local details of the fused feature and achieve fine-grained alignment for completion. MAFF and FDC formulate a global-to-local alignment and completion paradigm for multi-modal SSC. Extensive experiments demonstrate that MFDNet reduces parameters by 54.4% while achieving state-of-the-art performance on the NYUv2 and NYUCAD datasets.
Paperid: 2791,   Poster  
Authors: Yuming Meng, Dong Wu, Hongbin Zha
Title: ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
Abstract: Understanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build openvocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-Aware 4D Referring Field that assigns each Gaussian a time-invariant embedding, enabling robust instance-level grounding for both time-agnostic and time-sensitive referring queries. On top of this, an Instance-level Temporal State Mapping module models a view-independent mapping from instance identity and time to semantic states directly in feature space. To obtain rich supervision without manual annotation, we design a task-adaptive captioning pipeline that uses multimodal large language models to generate complementary frame-level descriptive captions and time-aware state captions for each object. We construct a new benchmark on dynamic 4D reconstructions with spatio-temporally grounded referring expressions and adapt state-of-the-art 3D/4D language grounding methods as baselines.Extensive experiments show that ST4R-Splat significantly outperforms baselines on both spatial (time-agnostic) and temporal (time-sensitive) metrics, establishing a strong foundation for fine-grained, language-driven understanding of dynamic 4D scenes.
Paperid: 2792,   Poster  
Authors: Thomas Besnier, Emery Pierson, Sylvain Arguillere, Maks Ovsjanikov, Mohamed Daoudi
Title: PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
Abstract: We present PaNDaS, a novel deep learning framework for Partial NonRigid Deformations and interpolations of Surfaces (PaNDaS). PaNDaS learns a per-face feature field on the source mesh and fuses it with a global encoding of the target. A deformation generator predicts a Jacobian field and recovers a smooth displacement, enabling precise regional control, pose mixing, and transferable local edits. Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Across various human body part datasets, PaNDaS achieves state-of-the-art interpolation accuracy and stronger locality than methods based on global shape codes or handles, while remaining robust to remeshing. We demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations.
Paperid: 2793,   Poster  
Authors: Benlei Cui, Fangao Zeng, Weitao Jiang, Yuwen Zhai, Haiwen Hong, Longtao Huang, Hui Xue, Wenxiang Shang, Pipei Huang
Title: SIMPLEPOSTER: A SIMPLE BASELINE FOR PRODUCT POSTER GENERATION
Abstract: Product poster generation presents unique challenges beyond generalpurpose de-sign: it demands not only aesthetic composition and accurate text rendering, butalso strict preservation of the product subject and precise control over dense,multi-line text layouts. While general image editing models struggle with text lay-out control and subject consistency, existing specialized approaches—often builtupon inpainting frameworks—still suffer from unintended subject extension andinaccurate text synthesis. A common solution involves integrating auxiliary mod-ules such as ControlNet to condition on subject structure and text layout, but theseapproaches introduce significant architectural complexity and training overhead.In this work, we challenge the necessity of such complexity and demonstrate thatminimalist adaptation is sufficient. We introduce SimplePoster, a minimalist yetpowerful inpainting-based framework that enables faithful subject preservationand position-controllable text rendering—entirely without external controllers likeControlNet. SimplePoster rests on two key insights: (1) full-parameter fine-tuningalone effectively suppresses subject extension by aligning the model’s internalrepresentations with domain-specific priors; and (2) a lightweight character-levelposition encoding strategy enables end-to-end, spatially grounded text generation.Experiments show that SimplePoster achieves near-perfect subject preservation(98.7% of cases with strict subject preservation), significantly outperforming boththe state-of-the-art editing model SeedEdit3.0 (55.2%) and the specialized ap-proach PosterMaker (85.3%). It further demonstrates superior text rendering ac-curacy, even in challenging scenarios with complex multi-line layouts. We believeSimplePoster establishes a simple yet strong baseline for product poster genera-tion. We question the necessity of such complexity and demonstrate that minimalist designs suffice. We propose SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and position-controllable text rendering without relying on external controllers like ControlNet. SimplePoster is based on two key insights: (1) full-parameter fine-tuning effectively suppresses subject extension; and (2) a training-free character-level position encoding strategy enables end-to-end, geometry-aware text generation. Remarkably, SimplePoster achieves a near-perfect subject preservation rate (98.7%), significantly outperforming SOTA models SeedEdit 3.0 (55.2%) and PosterMaker (85.3%). It also excels in text rendering accuracy. We believe SimplePoster establishes a simple yet strong baseline for product poster generation. Code, models and benchmark will be released upon acceptance.
Paperid: 2794,   Poster  
Authors: Anagh Malik, Dorian Chan, Xiaoming Zhao, David B. Lindell, Oncel Tuzel, Rick Chang
Title: Velox: Learning Representations of 4D Geometry and Appearance
Abstract: We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set ofdynamic shape tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the timevarying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance.To demonstrate the utility of our representation, we evaluate it across three downstream tasks—video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation—and observe strong performances in all settings.
Paperid: 2795,   Poster  
Authors: Tanush Yadav, Reza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna
Title: VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Abstract: Videos capture a rich array of subtleties in actions. While large video language models have advanced in understanding long videos, their ability to discern nuanced motions in domainspecific, fine-grained actions remains unclear. Current benchmarks evaluate for fine-grained actions in a domain agnostic manner, making to hard to evaluate models on this task. To address this gap, we introduce \dataset, a comprehensive benchmark aimed at evaluating the domain-specific, fine-grained action understanding of video models.This benchmark covers 1,087 distinct actions spanning 38 domains, from bouldering to suturing.Our evaluations demonstrate that current video models encounter significant difficulties in recognizing these actions in a zero-shot scenario. We then examine how to improve model performance on this task. To this end, we collect a training dataset of 160K clips of fine-grained, domain-specific actions. Post-training a 4B model on this data, we surpass all Gemini models and GPT-4o on our benchmark. Next, we evaluate few-shot evaluation and demonstrate that even the best-performing model, GPT-5, struggles in a few-shot evaluation setting. When given three in-context examples, the gap between model and human performance widens, with human accuracy improving by 13% while models only improve by 3%. This suggests that video language models are currently not effective few-shot learners--unlike their text-only counterparts and further gains may be elicited from improving these models' few-short learning capabilities.
Paperid: 2796,   Poster  
Authors: Yiying Yang, Wei Cheng, Sijin Chen, Honghao Fu, Xianfang Zeng, Yujun Cai, Gang Yu, Xingjun Ma
Title: OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Abstract: OmniLottie is a versatile framework that generates highquality vector animations from multi-modal instructions, including interleaved texts, images, and videos. To fully parameterize vector animations for flexible motion and visual content control, we seek help from the Lottie representation, which encodes both shapes and animated behaviors in a single JSON file. Building upon a pretrained vision–language model (VLM), OmniLottie produces vivid, semantically aligned vector animations that adhere closely to multi-modal conditions. To avoid the complexity and irregularity of raw JSON structures, we introduce a dedicated Lottie tokenizer that transforms Lottie files into structured sequences of function calls representing shapes, animation commands, and their parameters. This design enables the model to directly learn the underlying shape and animation priors from data, substantially improving generation stability and controllability. To further advance research in vector animation generation, we curate MMLottie-2M, a large-scale dataset of professionally designed vector animations paired with textual and visual annotations. Leveraging the well-designed tokenizer and our newly established dataset, OmniLottie demonstrates strong multi-modal conditional generation capabilities using a simple next-token prediction objective. For qualitative results, please refer to the generated animations rendered through standard Lottie players on the supplementary website.
Paperid: 2797,   Poster  
Authors: Minghao Yin, Wenbo Hu, Jiale Xu, Ying Shan, Kai Han
Title: Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
Abstract: Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a timedecaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis, charting a path toward efficient and scalable 4D generation.
Paperid: 2798,   Poster  
Authors: Tingyun Liu, Licheng Liu, Qibin Zhang, Qiying Feng, C.L.Philip Chen
Title: Graph Attention Prototypical Network for Robust Few-Shot Classification
Abstract: Fewshot learning has attracted extensive attention, with metric-based approaches such as Prototypical Networks establishing strong baselines. These methods construct class prototypes from support samples and classify query samples via distance metrics, but their performance is highly sensitive to label noise. To tackle this challenge, we propose a novel graph attention prototypical network (GAPNet) for robust few-shot classification. GAPNet first extracts local and global features via a classic CNN backbone and a group attention broad learning module, respectively. To mitigate the impact of label noise, the intra-class and inter-class relationships between support and query samples are explicitly modeled via a pseudo-label guided graph constructor, and then processed by an edge-aware graph attention module to capture topological correlations. Furthermore, an adaptive noise-robust prototype generator is introduced to dynamically suppress the contributions of noisy samples, substantially improving the reliability of class prototypes. Extensive experiments demonstrate the effectiveness and robustness of GAPNet to label noise. Compared to state-of-the-art approaches, GAPNet improves accuracy in the 5-way 5-shot setting by 3% ~ 8% on three general image benchmarks and one fine-grained classification dataset.
Paperid: 2799,   Poster  
Authors: Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Ángel Bautista, Joshua Susskind, Björn Ommer
Title: Learning Long-term Motion Embeddings for Efficient Kinematics Generation
Abstract: Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a longterm motion embedding that is learned from large-scale trajectories obtained from tracker models.This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
Paperid: 2800,   Poster  
Authors: Hanyu Chen, Ruojin Cai, Steve Marschner, Noah Snavely
Title: ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
Abstract: Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learningbased methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting3D-grounded reflectional symmetriesfrom single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.
Paperid: 2801,   Poster  
Authors: Zhengzhong Zhu, Pei Zhou, Lanxi Bai, Li Cheng, Jia Nie, Shiquan min, Jiangping Zhu
Title: Reliable Clustering Number Estimation for Contrastive Multi-View Clustering
Abstract: In recent years, contrastive multiview clustering has achieved remarkable performance improvements. However, existing methods still face two key challenges: (1) reliance on a predefined number of clusters k, which is often unknown in real-world scenarios; and (2) contrastive learning might cause representation degeneration when thecollected multiple views inherently have inconsistent semantic information . To address these issues, we propose a novel framework—Reliable Clustering Number Estimation for Contrastive Multi-View Clustering (RCNMC). RCNMC consists of a Semantics-Aware Contrastive Learning module and a Reinforcement Learning-based Cluster Number Learning module. Specifically, the Semantics-Aware Contrastive Learning module first measures the discrepancy between pairwise representations and adaptively strengthens useful pairwise views while weakening unreliable ones, thereby alleviating representation degeneration. The Reinforcement Learning-based Cluster Number Learning module infers the optimal number of clusters in an unsupervised manner by using intra-cluster and inter-cluster distances as a reward-driven strategy. The two modules complement each other, making RCNMC more suitable for complex multi-view clustering tasks in real-world scenarios. Extensive experiments on multiple benchmark datasets demonstrate that RCNMC significantly outperforms existing state-of-the-art methods.
Paperid: 2802,   Poster  
Authors: Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr
Title: When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
Abstract: Textto-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare their quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts amplify these signatures. Our findings expose fundamental security risks in T2I leaderboards and motivate stronger anonymization defenses.
Paperid: 2803,   Poster  
Authors: yanze ren, Mingyuan Lv, Qinhong Jiang, Yan Jiang, Chen Yan, Xiaoyu Ji, Wenyuan Xu
Title: Physical Adversarial Examples through Camera Power Signal Injection
Abstract: Physical adversarial examples pose a concrete threat to real‑world computer vision systems. Existing works mainly generate physical adversarial examples by affixing patches or projecting light onto targets, which are usually visible and can expose the malicious intention. In this work, we reveal a new attack surface that generates invisible adversarial samples by injecting signals into the camera's power supply. We analyze the mechanism of injecting structural stripe patterns into cameras and demonstrate the feasibility of controllable finegrained injection with signal modulation. We develop a simulation model to emulate the physically injected perturbation, and propose end-to-end optimization methodologies in both white-box and black-box settings to generate the injection signal parameters. We perform a simulated evaluation across seven classification models and carry out physical signal injection experiments with optimized signals. The results show that physical adversarial examples generated through camera power signal injection can disrupt computer vision performance. Our work introduces a new methodology for physical adversarial examples, emphasizing the need for securing computer vision systems in the physical world.
Paperid: 2804,   Poster  
Authors: Kaiyue Sun, Weiyang Jin, Chengqi Duan, Rongyao Fang, Xian Liu, Yuwei Niu, Chunwei Wang, Aoxue Li, Xihui Liu
Title: UniVerse: Empower Unified Generation with Reasoning and Knowledge
Abstract: Current textto-image (T2I) generation models often struggle with prompts that require complex reasoning or specialized knowledge, failing to accurately interpret implicit user intent. To bridge this gap, we introduce T2I-Reason, a large-scale dataset designed to empower text-to-image generation in unified multimodal models (UMMs) with reasoning and knowledge. The dataset contains 120k pairs of text triplet and image. The text triplet consists of (1) an implicit prompt, which requires reasoning or knowledge to decipher its underlying meaning; (2) a reasoning chain, which provides a step-by-step analysis to resolve the implicit prompt's meaning; and (3) an explicit prompt, a clear and straightforward visual description prepared for T2I generation. T2I-Reason is meticulously constructed: 65k samples are dedicated to reasoning, specifically targeting arithmetic reasoning, spatial-attribute relationship reasoning, deductive reasoning (cause to effect), and abductive reasoning (effect to cause). While 55k samples necessitate specialized knowledge, which covers multiple disciplines, spatial-temporal concepts, and entity knowledge. To validate the effectiveness of our dataset, we train a unified multimodal model, Bagel, on our dataset. Results across multiple benchmarks that evaluate the reasoning capabilities of T2I generation demonstrate that our model achieves significant and consistent improvements on both composition and reasoning, confirming that explicit training on intermediate reasoning chains is a pivotal step towards more intelligent unified generative models.
Paperid: 2805,   Poster  
Authors: Yuan Zhao, Xiaoqin Zhang, Huchuan Lu, Lihe Zhang
Title: Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection
Abstract: Multimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing crossmodal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective Complementary Prototype Mapping (CPMAD) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary priors, thereby guiding and disambiguating cross-modal mappings.The framework comprises three key components:(1) Consensus Extraction Module (CEM) learns a dynamic anchor, transforming multimodal features into anomaly-free consensus prototypes to improve cross-modal consistency and suppress latent anomalies;(2) Supplementary Query Module (SQM) employs a Complementary Residual Attention mechanism to capture the discrepancy between the consensus and modality-specific spaces, thereby exploring the most representative and discriminative cues as supplementary prototypes; and(3) Complementary Mapping Module adaptively integrates both prototypes to perform feature mapping.Extensive experiments demonstrate that CPMAD not only achieves superior performance in both full-data and few-shot settings across diverse industrial and medical scenarios but also maintains faster inference speeds and lower memory consumption compared to existing methods.The code will be released upon publication.
Paperid: 2806,   Poster  
Authors: Yu Zheng, Kai Zhang, Wei Zhu, Qingguo Liu, Xiantao Hu, Jun Li, Jian Yang
Title: DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution
Abstract: Nextscale prediction paradigm visual autoregressive (VAR) models have demonstrated significant potential for image super-resolution. However, their practical application is constrained by a rigid, size-specific design. This limitation stems from their reliance on memorizing fixed, absolute scaling schedules, which necessitates a distinct model for each target resolution. We introduce DVAR, a Dynamic Visual AutoRegressive framework that overcomes this fundamental bottleneck. Instead of memorizing these rigid schedules, DVAR learns a canonical scaling dynamic. This dynamic effectively decouples the logic of relative scaling from the absolute target size, thereby preserving a single set of proportions between generative steps that can be applied uniformly to any size. Furthermore, we introduce a dynamic sampling scheduler to mitigate the teacher-forcing problem with negligible computational overhead. By leveraging the geometric proximity of visual tokens in the codebook, it efficiently simulates the model's predictive error distribution to bridge the training-inference gap. To our knowledge, DVAR is the first framework to grant VAR models size-flexibility, breaking their one-to-one dependency on a fixed resolution. Extensive evaluations demonstrate that DVAR achieves superior visual quality over existing Real-ISR methods, proving that a flexible, purely autoregressive approach is a viable path to state-of-the-art image super-resolution.
Paperid: 2807,   Poster  
Authors: Prajnan Goswami, Tianye Ding, Feng Liu, Huaizu Jiang
Title: UniCorn: Unified Correspondence Transformer Across 2D and 3D
Abstract: Visual correspondence across imageto-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Code and model checkpoints will be made publicly available.
Paperid: 2808,   Poster  
Authors: Xinqi Lyu, Yihao LIU, Dong Wang, Bin Xiao
Title: Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
Abstract: Retrievalaugmented diffusion models (RAG-DMs) have been increasingly deployed across applications, alleviating the data and compute demands of conventional diffusion models. Despite the success, their trustworthiness remains underexplored. Existing backdoor attacks focus on either manipulating the generation phase or the retrieval phase under the white-box setting, which suffer from knowledge conflicts between retrieved images and user prompts. To bridge this gap, we propose a novel red-teaming approach JOB, which is the first jointly optimized backdoor attack tailored to black-box RAG-DMs. Specifically, JOB poisons the knowledge base with a small number of target class images and learns a trigger through multi-objective optimization, steering retrieval toward poisoned images and aligning the generated outputs with the target class, while preserving benign performance. Experiments show that JOB effectively attacks black-box RAG-DMs, achieving high success rates and outperforming state-of-the-art baselines.
Paperid: 2809,   Poster  
Authors: Yi Fan, Yu-Bin Yang
Title: Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation
Abstract: In the deeplearning-based computer vision community, Neural Architecture Search (NAS) has become the de-facto tool for acquiring task-optimal network structures. Nevertheless, NAS methods are trapped in a fundamental accuracy-efficiency dilemma: training-based approaches deliver reliable performance but incur prohibitive search costs, whereas training-free strategies are ultra-fast but often yield relatively unreliable rankings. To reconcile this conflict, we propose a vision-oriented lightweight training-based NAS framework. We first design six micro vision tasks whose training time is negligible, yet together they probe a broad spectrum of representational capacities. Built upon these tasks, we introduce a budget-adaptive performance evaluator to produce the most accurate ranking attainable within the limit. Experiments on popular NAS benchmarks show that our method achieves a ranking correlation higher than existing methods. Furthermore, we construct a search space from prevalent neural blocks and run our method at a cost close to training-free methods; the discovered architecture surpasses the current state-of-the-art under identical training recipes. Our code will be released upon publication.
Paperid: 2810,   Poster  
Authors: Xiaoqi An, Lin Zhao, Jun Li, Chen Gong, Jian Yang
Title: Bézier Degradation Modeling for LiDAR-based Human Motion Capture
Abstract: LiDARbased 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.
Paperid: 2811,   Poster  
Authors: Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos
Title: Recovering Physically Plausible Human-Object Interactions from Monocular Videos
Abstract: In this paper, we present a method to reconstruct physically plausible humanobject interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physical artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework that begins with a kinematic estimate and then refines it through a reinforcement learning (RL) policy trained to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that automatically identifies the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods.
Paperid: 2812,   Poster  
Authors: Xiantao Ma, Siwei Dong, Lin Zhu, Lizhi Wang, Hua Huang
Title: Seeing Through Blur: Tackling Defocus in Spike-Based Imaging
Abstract: Spike cameras are a novel class of neuromorphic vision sensors that capture scene dynamics with ultrahigh temporal resolution via spike planes. While recent methods have addressed motion blur and noise in spike-based reconstruction, defocus blur caused by shallow depth of field or lens adjustment delays remains a critical yet underexplored issue in real-world applications such as autonomous driving. In this work, we present DeSpike, the first end-to-end defocus removal framework specifically designed for spike cameras. Our method begins by explicitly modeling the defocus formation process using a physics-inspired thin-lens approximation to simulate spike responses under optical blur. Guided by this formulation, DeSpike employs multi-temporal-scale integrate-and-fire (IF) neurons to compensate for FPN and extract defocus-aware features from spike streams. These features are then processed by a physics-informed deblurring module constructed from learnable discrete PSF priors. To address spatially variant blur, we introduce a Transformer-based fusion mechanism that adaptively weighs multi-scale deblurring results through attention across defocus levels. Finally, a coarse-to-fine iterative refinement stage combines spike features and PSF priors for progressive restoration. Extensive experiments on both synthetic and real-world defocused spike datasets demonstrate that our method achieves superior performance over state-of-the-art deblurring approaches in terms of structural fidelity, perceptual sharpness, and contrast, setting a new benchmark for defocus-aware spike-based image reconstruction.
Paperid: 2813,   Poster  
Authors: Seyeon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han
Title: Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
Abstract: Videos are increasingly used as inputs to machine learning models, where repeated decoding and processing for diverse downstream tasks dominate computational costs. However, existing video processing pipelines remain inefficient: traditional video codecs (H.264, H.265) are optimized for human visual quality and require full pixel decoding for each inference, Compressed Domain Inference (CDI) is tightly coupled to specific codec structures with limited task flexibility, and Video Coding for Machines (VCM) demands separate representations taskspecific encoders without human visualization support.We propose Neural Video Pipeline (NVP), a framework that leverages Implicit Neural Representations (INR) to directly extract task-specific features from intermediate layers, eliminating pixel reconstruction overhead.NVP employs lightweight Micro Adapters to bridge INR features directly into the feature space of downstream models, bypassing both decoding and early extraction stages.Through comprehensive benchmarks across four representative tasks (image classification, object detection, action recognition, and segmentation), NVP reduces latency up to 89.5%, inference FLOPs up to 29.9%, while supporting multiple tasks with a single unified representation.
Paperid: 2814,   Poster  
Authors: Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
Title: Act2See: Emergent Active Visual Perception for Video Reasoning
Abstract: VisionLanguage Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.
Paperid: 2815,   Poster  
Authors: Jiaxuan Xu, Lei Duan, Xinye Wang, Liang Du
Title: Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling
Abstract: Ensemble clustering aims to derive a consensus partition from multiple base clustering results. Anchorbased methods construct compact similarity representations via anchors, substantially improving computational efficiency. However, when outliers contaminate the data, reconstructing the base clustering results often yields biased anchors. These biased anchors degrade the quality of the anchor similarity matrix and lead to a decline in clustering accuracy. To address this issue, we propose a novel method called large-scale robust enhanced ensemble clustering via outlier decoupling (RANGE). Specifically, RANGE first converts the base clustering results into an initial bipartite graph. To enhance the reliability of this bipartite graph, RANGE designs a high-order fuzzy enhancement strategy (HFES) specifically for initial bipartite graphs. Next, a mapping matrix further filters redundant information from the enhanced bipartite graph. RANGE then reconstructs the mapped bipartite graph via matrix factorization. An anchor matrix is introduced to further enhance computational efficiency. To improve robustness, RANGE incorporates a decoupling term that separates the clean clustering structure and the outlier-contaminated structure in the anchor space. With this decoupling mechanism, RANGE is capable of performing robust ensemble clustering. Moreover, by applying outlier detectors to the decoupled outlier structure, RANGE can be extended to the outlier-detection task. Consequently, RANGE forms a cross-task general framework, and both tasks retain linear time complexity. Extensive cross-domain experiments indicate that RANGE delivers superior performance in both clustering validity and outlier detection. The code is available in the supplementary material.
Paperid: 2816,   Poster  
Authors: Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, Marco Cristani
Title: StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Abstract: Edgebased representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them “structure-centric”. Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval on both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and models will be released.
Paperid: 2817,   Poster  
Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kecheng Zheng, Wenwu Zhu
Title: Reasoning Diffusion for Unpaired Text-Image to Video Generation
Abstract: Textimage to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with ``unpaired'' text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this unpaired setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal large language model to analyze the conditioned unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff significantly beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs.
Paperid: 2818,   Poster  
Authors: Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang
Title: Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
Abstract: Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce HilbertGeo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image).In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes.Notably, our proposed Hilbert-Geo is also applicable to plane geometry.To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.
Paperid: 2819,   Poster  
Authors: Aoxiang Ning, Kailong Yu, Minglong Xue, Liyuan Pan, Jinhong He, Wenchao Yan, Mingliang Zhou, Yirui Wu
Title: Language-Guided One-Step Diffusion Model for Nighttime Flare Removal
Abstract: Nighttime photography is susceptible to flare caused by strong light sources, which degrades visual quality and disrupts structural information required by downstream vision tasks. Existing nighttime flare removal methods generally lack semantic priors for flareoccluded regions and thus tend to introduce artifacts and lose details under severe degradation. To address this problem, we propose a language-guided one-step diffusion framework that explicitly aligns flare-occluded regions with the underlying scene content at the semantic level. Specifically, we develop the first flare-specific vision–language model, Flare-VLM, which extracts fine-grained textual descriptions to guide one-step diffusion for high-quality restoration of severely damaged areas. Then, we propose semantics-aware distribution distillation to constrain the noise distribution with high-level semantics, suppressing redundant perturbations on clean backgrounds and improving the stability of distillation. In addition, we design an instruction-driven data synthesis pipeline to generate geometrically and semantically aligned nighttime flare samples, narrowing the gap to real degradations. Experimental results demonstrate that the proposed method achieves better restoration and enhances the performance of downstream vision tasks.
Paperid: 2820,   Poster  
Authors: Bishal Swain, Kyung Joo Cheoi, Jaepil Ko
Title: Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
Abstract: Continuoustime neural networks provide adaptive dynamics, but rely on a single hidden state to encode both fast input fluctuations and longer-term context. This shared representation forces rapidly changing inputs to overwrite slower contextual signals, causing the model to lose past information as new observations arrive. In contrast, biological perceptual systems maintain stable behaviour under evolving sensory input by integrating ongoing signals with stored associative patterns rather than relying on a single evolving state.Motivated by this distinction, we study a simple coupling of Liquid Time-Constant Networks (LTCs) with a Modern Hopfield Network (MHN) that serves as a content-addressable memory. At each time step, the liquid state is projected into a query, the MHN retrieves a memory vector, and the two representations are concatenated before a readout layer. We analyse this coupling under standard norm and Lipschitz assumptions and show that the combined representation remains bounded. We further show that the retrieval map contracts gradients for parameters upstream of the memory query, which provides a mechanism for reducing curvature in the loss landscape.On public time-series benchmarks, the coupled LTC-MHN model improves mean accuracy by 2.3% over competitive recurrent and continuous-time baselines and reduces the estimated Hessian trace by about an order of magnitude relative to a standalone LTC encoder, with the largest gains on classification tasks and competitive performance on a regression task. Qualitative analyses of training curves, loss landscapes, and latent embeddings support the interpretation that Hopfield retrieval smooths optimization and encourages more compact, linearly separable class manifolds. Code will be released upon publication.
Paperid: 2821,   Poster  
Authors: Xiaogang Wu, Jinchao Hu, Zixian Wang, Dun Liu, BoXiang Cheng, Yiqiang Wu
Title: Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
Abstract: We propose a probabilistic discrepancy learning approach for roadside LiDAR scene completion (PDL). Conventional methods focus on objectlevel completion and scene completion from ego-vehicle viewpoint. These methods struggle to cope with long-term or total occlusions caused by roadside sensors with fixed viewpoints. To address this issue, we compensate for occlusion roadside point clouds by introducing external visual information. Specifically, Our PDL is mainly divided into probabilistic pose discrepancy minimization and scene discrepancy learning. We employ probabilistic pose discrepancy minimization to correct noisy poses from vision-based detectors, while utilizing a diffusion model within scene discrepancy learning for robust full-scene completion.Furthermore, we introduce regional and global sampling discrepancy learning losses to achieve robust and efficient training. We conducted extensive experiments on the V2X-Seq and TUMTraf-V2X roadside datasets. Results demonstrate that DT-VEM achieves state-of-the-art performance, with average reductions of 14.5% in chamfer distance (CD) and 6% in 3D Jensen Shannon divergence (JSD) compared to existing methods.
Paperid: 2822,   Poster  
Authors: Chung-Shien Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe
Title: Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
Abstract: Efficient and accurate feedforward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, \pi^3 and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than 3× while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlesslyintegrates into existing global attention-based architectures such as VGGT, \pi^3, and MapAnything, while substantially improving scalability to large image collections.
Paperid: 2823,   Poster  
Authors: xucong wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuangwang Shuangwang, Yang Wang
Title: FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models
Abstract: MultiLabel Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.
Paperid: 2824,   Poster  
Authors: Yansong Li, Zhongxi Qiu, Yun Tian, Zheng jinyu, Shuo Li
Title: CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis
Abstract: Cardiac magnetic resonance (CMR) is the clinical gold standard for assessing cardiovascular diseases, but its interpretation relies on expert experience and remains challenging, particularly for identifying rare diseases. Existing automated methods lack interpretable reasoning processes, limiting clinical adoption. Although visionlanguage models (VLMs) possess basic visual understanding and text generation capabilities, they still lack verifiable reasoning chains in medical diagnosis and underperform on minority classes in long-tail distributions. To address these challenges, we propose CMR-RD, to our knowledge the first VLM for interpretable diagnosis in CMR, capable of generating explicit diagnostic chains aligned with imaging evidence. We construct a CMR dataset that reflects real-world clinical distributions, comprising five disease categories (including two rare conditions) plus normal controls. Building on this, the general-purpose VLM is aligned to medical and CMR semantics using large-scale medical vision–text data, and cold-start training is used to enhance its understanding of medical concepts and basic reasoning. To enhance reasoning and performance on rare samples, we propose Group Phase Policy Optimization (GPPO), which combines online multi-stage reinforcement learning (RL)with adaptive sampling. GPPO enables the model to proactively explore rare and underperforming classes, thereby effectively mitigating long-tail bias. Experiments demonstrate that CMR-RD achieves state-of-the-art accuracy and reasoning-chain correctness compared with medical and general VLM baselines, shows stronger recognition of rare categories, and exhibits higher data efficiency. These results provide an interpretable pathway for automated CMR diagnosis.
Paperid: 2825,   Poster  
Authors: Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang
Title: EarlyTom: Early Token Compression Completes Fast Video Understanding
Abstract: Video large language models (VideoLLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65× and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
Paperid: 2826,   Poster  
Authors: Zhiqiu Lin, Siyuan Cen, Chancharik Mitra, Isaac Li, Yuhan Huang, Yu Ling, Hewei Wang, Irene Pi, Shihang Zhu, Yili Han, Yilun Du, Deva Ramanan
Title: Building a Precise Video Language with Human–AI Oversight
Abstract: Video–language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, supported by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate highquality captions, we introduce a critique-based human–AI (CHAI) oversight framework, where trained human experts provide correctional critiques to revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for fine-tuning, improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through standard SFT, offline RL (DPO), online RL (GSPO), and inference-time scaling. With modest expert supervision, the resulting system outperforms even closed-source models such as Gemini-2.5-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of over 400 words, achieving finer control over camera motion, angle, lens, perspectives, and shot composition. Overall, our results show that precise specification and human–AI oversight are key to achieving professional-level video understanding and generation.
Paperid: 2827,   Poster  
Authors: Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen
Title: MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
Abstract: The new era has witnessed a remarkable capability to extend VisionLanguage Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.
Paperid: 2828,   Poster  
Authors: Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, WANG XIN, Mike Zheng Shou
Title: Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Abstract: Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable imagetext instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and, ultimately, agent capabilities.
Paperid: 2829,   Poster  
Authors: Yuxi Xiao, longfei li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
Title: SpatialTree: How Spatial Intelligence Branches Out in MLLMs
Abstract: Spatial Intelligence (SI) has emerged as a critical frontier for MLLMs, encompassing a hierarchy of skills from foundational perception to high level spatial reasoning. However, how these abilities are acquired, emerge, and transferred remains largely unknown. To investigate this, we propose SpatialTree a hierarchical taxonomy that organizes SI into a capability tree—from low level perception (L1), mental mapping (L2), mental simulation (L3), to agentic competence (L4). Building on this, we construct a hierarchical, capabilitycentric benchmark using our proposed Spatial Engine, annotating each ability according to its level. Guided by the benchmark's correlation analysis, we conduct targeted supervised fine-tuning (SFT) and prompting experiments on key abilities. The results confirm the independence of abilities at the same level, reveal cross-level transfer, and further demonstrate a multi-ability synergy when these abilities are trained jointly. Our work provides a novel framework for analyzing SI in MLLMs, offering a comprehensive methodology to study how foundational abilities emerge and support higher-level competencies.
Paperid: 2830,   Poster  
Authors: Qi Zang, Dong Zhao, Nan Pu, Wenjing Li, Zhun Zhong, Meng Wang
Title: GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation
Abstract: Vision Foundation Models (VFMs) provide rich and transferable representations through largescale pretraining, yet their high-capacity representations remains underutilized when adapted to downstream tasks. In Domain Generalization Semantic Segmentation (DGSS), parameter-efficient fine-tuning (PEFT) often overfits adapters to source-domain statistics and seen-class boundaries, leading to representation degradation manifested as domain bias and semantic rigidity. Existing regularization strategies alleviate this through random perturbations, but such operations disrupt the pretrained geometric structure, causing semantic drift and unstable generalization.We propose Geometry-Consistent Regularization (GeCo), which extrapolates the pretrained representation space toward the target task under structure-respected constraints, thereby preserving the inherent generalization of VFMs while enhancing their task-specific adaptation. GeCo introduces curvature-guided perturbation to modulate feature variation according to local manifold complexity of the pre-trained embedding space, enabling structure-aligned representation expansion. Complementarily, a geodesic-based regularization constrains prediction shifts along smooth, manifold-aligned trajectories, ensuring semantic continuity and stable decision behavior.Extensive experiments demonstrate that GeCo achieves superior generalization across both closed-set and open-set DGSS benchmarks.
Paperid: 2831,   Poster  
Authors: Mingwen Shao, Qiao Zhang, Xinyuan Chen, Xiang Lv, Lingzhuang Meng, Chang Liu, Qinglin Zhan, Ling Jian
Title: Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
Abstract: Poseagnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation under sparse views, we develop a wavelet-based pose estimator that integrates low-frequency structural cues and high-frequency details to enhance both initialization and refinement accuracy. Finally, we introduce a wavelet difference-aware anomaly detector that computes frequency-domain anomaly scores, improving localization robustness against pose and geometric variations. By integrating these strategies, Wave-Pose3D achieves robust and accurate anomaly localization under sparse views. Extensive experiments validate that the proposed approach achieves state-of-the-art performance under 10% and 20% sparse-view configurations.
Paperid: 2832,   Poster  
Authors: Ning Han, Zhenyu Ge, Feng Han, Yuhua Sun, Chengqing Li, Jingjing Chen
Title: Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
Abstract: Concept erasure aims to remove harmful, inappropriate, or copyrighted content from textto-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.
Paperid: 2833,   Poster  
Authors: Lingxiao Li, Dongwon Kim, Lingyan Ruan, Taesoo Kwon, Bin Chen, Taehyun Rhee
Title: SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
Abstract: Textguided motion generation in 3D scenes has advanced the synthesis of human–scene interactions, contributing to embodied AI, scene understanding, and virtual agent simulation. While recent studies have begun exploring multi-agent scenarios, achieving temporally synchronised interactions among multiple agents remains an open challenge. Existing methods are often limited in flexibility and scalability when handling diverse interaction contexts.We present a method that enables synchronised multi-agent interaction using a single-agent motion synthesis model through two key components: a text-guided dependency-aware story planner and a temporal synchronisation module. The story planner interprets natural language instructions into structured event sequences with temporal dependencies. Our synchronisation module, built upon time-warping control and diffusion posterior sampling, aligns interaction timing across agents without retraining.Experimental results demonstrate that the proposed framework effectively models temporal dependencies and causal order between events. Evaluations across diverse interaction types show improved temporal alignment and coherent multi-agent motion generation consistent with textual instructions.
Paperid: 2834,   Poster  
Authors: Nan An, Long Ma, Tengyu Ma, Zhu Liu, Yingchi Liu, Risheng Liu
Title: BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery
Abstract: The emergence of large generative models has substantially advanced learningbased scene recovery in the synthetic domain. However, applying these models directly to real scenarios reveals sub-optimal performance stemming from the significant distribution gap, alongside poor adaptation to complex and unforeseen degradations. Consequently, it is imperative to develop a real scene adaptation strategy that yields faithful restorations with reliable generalizability. To this end, we propose Bilevel Prompt LoRA, a novel learning paradigm designed to effectively adapt pre-trained generative models for real scene recovery. First, we introduce a self-supervised distribution-fidelity learning scheme to calibrate the autoencoding pathway under task-irrelevant real distributions, thereby recovering high-fidelity textures. Subsequently, a bilevel joint modeling via hyperparameter optimization is further established, empowering robust synthetic-to-real adaptation for both seen and unseen scenes by exploiting the complementary advantages between LoRA and Prompts to foster mutual promotion. Extensive evaluations on diverse real adverse scenarios demonstrate our method's superiority, with comprehensive algorithm analyses proving our effectiveness. The code will be public released upon the acceptance.
Paperid: 2835,   Poster  
Authors: Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka, Thomas O'Connell, William Freeman, Joshua B. Tenenbaum
Title: GenMatter: Perceiving Physical Objects with Generative Matter Models
Abstract: Human visual perception offers valuable insights for understanding computational principles of motionbased scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
Paperid: 2836,   Poster  
Authors: Andong Lu, Ziyi Zha, Jiandong Jin, Shihao Li, Chenglong Li, Jin Tang, Bin Luo
Title: Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
Abstract: Missing modalities in RGBT tracking often lead to incomplete and unstable multimodal feature representations that greatly degrade the performance. Existing methods typically attempt to recover missing modalities from available ones, but the quality of data generated in challenging scenarios might be unsatisfactory. In addition, current approaches exhibit limited flexibility in processing both missing and complete data. To overcome these limitations, we propose a Spatiotemporal Conditional Denoising Transformer (SCDT), which integrates the spatial cues and the temporal context to adaptively perform information reconstruction of missing modalities and feature enhancement of weak modalities in a unified framework, for robust modality-missing RGBT tracking. In particular, SCDT leverages the short-term temporal cues from recent historical frames to capture the fine-grained temporal correlations and the long-term temporal cues encoding modality evolution to capture the global context. By jointly exploiting long short-term temporal contexts as the conditions, SCDT progressively guides noisy features of available modalities to learn reliable and temporally consistent multimodal representations. Furthermore, SCDT introduces a noise-modulated adaptation mechanism that dynamically adjusts its behavior according to the modal availability, enabling a single framework to unify feature learning under both modality-missing and complete scenarios without changing the architecture or parameters. Extensive experiments on three public benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods.
Paperid: 2837,   Poster  
Authors: Yajing Liu, Yumeng Zhang, Yue Si, Baojie Fan, Jiandong Tian
Title: Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer
Abstract: Crossmodal transfer methods have achieved significant progress in extending RGB-based foundation models to non-RGB modalities. However, existing transfer paradigms are primarily task-oriented, meaning that changing tasks requires re-training and re-storing, leading to substantial redundancy in data, computation and storage. To address this limitation, we propose an efficient cross-modal transfer paradigm that decouples the process into a one-time general modality knowledge transfer and a flexible task knowledge transfer. In Stage 1, we propose a Progressive Self-Supervised Tuning strategy that integrates modality-aware structural reconstruction with semantic discriminative learning, which enables task-agnostic modality knowledge learning using only unlabeled data through a one-time training process, resulting in reusable target-modality LoRAs. In Stage 2, we incorporate the modality LoRAs and further propose a Task-Prompted Mixture-of-Modality Experts module. This design enables lightweight task knowledge injection while effectively balancing task-specific, modality-general and modality-specific knowledge in multimodal fusion process for diverse downstream tasks. Extensive experiments across six cross-modal transfer scenarios, along with analyses of data, computation, and storage efficiency, demonstrate the superiority of our method.
Paperid: 2838,   Poster  
Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Title: Stepwise Credit Assignment for GRPO on Flow-Matching Models
Abstract: FlowGRPO successfully applies reinforcement learning to flow models, but usesuniform credit assignmentacross all timesteps. This ignores the temporal structure of diffusion generation: early timesteps determine composition and content (low-frequency structure), while late timesteps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertentlyreward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We proposeStepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves visual quality while preserving stochasticity for policy gradients.
Paperid: 2839,   Poster  
Authors: Fengyuan Yang, Tanuj Sur, Tze Ho Elden Tse, Angela Yao
Title: HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
Abstract: Recovering global human and camera motion from monocular video is essential for worldcoordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be safely incorporated into bundle adjustment alongside background features. To mitigate noise in global human estimates, HumanBA applies motion refinements and motion-aware reliability weighting. Across EMDB and SLOPER4D benchmarks, we show consistent improvements on camera pose estimation and reduce global human reconstruction error, demonstrating the benefits of treating humans as dynamic yet informative landmarks.
Paperid: 2840,   Poster  
Authors: Mengting Xu, Shi Gu, Peng Lin, De Ma, Huajin Tang, Qian Zheng, Gang Pan
Title: Robust Spiking Neural Networks by Temporal Mutual Information
Abstract: Spiking Neural Networks (SNNs) have attracted increasing attention for their biologically inspired temporal dynamics. As their applications expand, understanding their robustness has become an important research focus. However, little is known about how the intrinsic temporal properties of SNNs affect robustness. In this work, we revisit SNN robustness from an informationtheoretic perspective and reveal the pivotal role of temporal dynamics. We establish a theoretical link between robustness error and the mutual information (MI) between inputs and latent representations along the temporal dimension, grounded in the information bottleneck principle. Through an analysis of spike-based information transmission, we show that temporal dynamics inherently compress MI, thereby tightening the robustness error bound. Building on this insight, we propose a Temporal Mutual Information (TMI) regularizer that explicitly exploits temporal characteristics to enhance robustness. Extensive experiments on CIFAR-10, CIFAR-100, DVS-CIFAR10, and Tiny-ImageNet demonstrate that our method consistently improves SNN robustness across various architectures and attack settings.
Paperid: 2841,   Poster  
Authors: Jianwei Zhao, Fan Yang, XIN LI, Qiang Zhai, Ao Luo, Ziqi Ren, Zhicheng Jiao, Hong Cheng
Title: GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
Abstract: Understanding how transcriptomic programs shape tissue morphology remains a central challenge in computational pathology. Geneto-WSI tile synthesis offers a principled generative framework to translate molecular profiles into histological images. However, most existing methods compress RNA-Seq into a single global embedding injected once at initialization, an oversimplified design that weakens transcriptomic signals and induces non-causal associations between gene expression and tissue morphology. We present GeneVAR, an Autoregressive Gene-to-WSI model that reformulates synthesis as an iterative, coarse-to-fine generative process. At its core is a novel Causal MeanFlow module that reinforces transcriptome-informed guidance at multiple stages and mitigates non-causal factors through counterfactual-style interventions, thereby ensuring biological fidelity throughout the generative trajectory. Combined with a \beta-VAE for compact gene embeddings and a multi-scale vector quantizer for discrete morphology representation, GeneVAR generates H\&E-stained WSI tiles that are both visually realistic and transcriptomically faithful. Extensive experiments across five TCGA cancer benchmarks demonstrate consistent state-of-the-art performance, surpassing prior methods in both generative fidelity and downstream classification accuracy. All models and code will be released to facilitate reproducibility.
Paperid: 2842,   Poster  
Authors: Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yihua Huang, Yangtian Sun, Shaoshuai Shi, Xiaojuan Qi
Title: Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
Abstract: Consistent 3D geometry estimation from streaming RGB input is crucial for realworld applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale–shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth’s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN-- a mere 2% additional parameters-- while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14% and even outperforming heavier non-causal video baselines.
Paperid: 2843,   Poster  
Authors: Fang Li, Shihao Zou, Weixin Si, Yang Gao, Shuai Li, Aimin Hao
Title: TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition
Abstract: Understanding complex surgical scenes requires recognizing multiple interdependent entities—such as instruments, actions, and targets—and maintaining their relational consistency across time. Existing surgical triplet recognition methods struggle to jointly model intraframe label dependencies and inter-frame temporal semantics in a unified manner. To address these limitations, we propose a unified framework that integrates spatial, relational, and temporal cues for robust surgical triplet recognition. Specifically, class-specific spatial priors are first extracted through a multi-scale encoder. Then, these priors are refined by a Label Correlation Modeling module with multi-scale class activation map-guided relational extraction (MS-CAMRE), enabling the model to capture both static co-occurrence and dynamic contextual dependencies among triplet components. Furthermore, a Bidirectional Temporal–Relational Fusion Attention (BTRFA) module harmonizes temporal and relational representations to achieve coherent temporal reasoning. We also introduce a new evaluation metric, the Triplet Consistency Error Rate (TCER), which quantitatively measures the model’s capability to preserve causal and semantic consistency across triplets. Extensive experiments on the CholecT45 and ProStaTD datasets show that our method achieves state-of-the-art (SOTA) performance, improving AP_IVT by 5.1% and 7.8%, respectively. Moreover, on the TCER metric, our approach yields over 36% and 25% relative reductions on the two datasets, respectively, underscoring the effectiveness of our framework in temporal–relational co-reasoning.
Paperid: 2844,   Poster  
Authors: Tongtian Yue, Xuange Gao, Longteng Guo, Zijia Zhao, Zikang Liu, Jie Jiang, Hua Huang, Jing Liu
Title: ROSE: Rotate Your Large Language Model to See
Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive progress in integrating visual and linguistic understanding. However, most existing MLLMs inject visual information into the input space of large language models (LLMs), which substantially increases context length and computational overhead, while often disrupting pretrained linguistic priors by forcing the LLM to optimize on visiondominant multimodal sequences. In this work, we propose a rotation-based vision injection paradigm that aligns visual information with the parameter space of LLMs. Visual semantics are encoded as rotation matrices and applied directly to the pretrained parameters. This parameter-space injection eliminates the need for long input sequences, thus avoiding the quadratic computational overhead inherent in input-space injection. Besides, it preserves the linguistic competence of the LLM by maintaining the intrinsic geometric structure of the pretrained parameters. Building upon this paradigm, we develop ROSE, a 7B MLLM that achieves fine-grained vision–language alignment with remarkable computational efficiency. Extensive experiments across 12 multimodal benchmarks show that ROSE delivers superior or competitive performance compared with leading models.At comparable accuracy, ROSE reduces FLOPs by 80.7% and inference latency by 56.4% relative to Qwen2.5-VL-7B, demonstrating the effectiveness and scalability. All training code, model weights and data will be publicly released.
Paperid: 2845,   Poster  
Authors: Ashshak Sharifdeen, Fahad Shamshad, Muhammad Akhtar Munir, Abhishek Basu, Mohamed Ismithdeen, Jeyapriyan Jeyamohan, Chathurika Silva, Karthik Nandakumar, Muhammad Haris Khan
Title: TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS
Abstract: Prompt tuning of largescale vision-language models such as CLIP enables efficienttask adaptation without updating model weights. However, it often leads to poorconfidence calibration and unreliable predictive uncertainty. We address thisproblem by proposing a calibration framework that enhances predictive reliabilitywhile preserving the geometry of the pretrained CLIP embedding space, which isrequired for robust generalization. Our approach extends the standard cross-entropyloss with two complementary regularizers: (1) a mean–variance margin penalty thatstabilizes inter-class logit margins by maximizing their average while minimizingdispersion, mitigating underconfidence and overconfidence spikes; and (2) a textmoment-matching loss that aligns the first and second moments of tuned textembeddings with their frozen CLIP counterparts, preserving semantic dispersioncrucial for generalization. Through extensive experiments across 7 prompt-tuningmethods and 11 diverse datasets, we demonstrate that our approach significantlyreduces the Expected Calibration Error (ECE) compared to competitive calibrationtechniques on both base and novel classes.
Paperid: 2846,   Poster  
Authors: Jiansong Zhang, Xiaying Yang, Xiaoling Luo, Linlin Shen
Title: EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame
Abstract: Reconstructing a physiologically plausible cardiac video from a single image remains a fundamental challenge in generative modeling, owing to the complex and nonlinear periodic dynamics of echocardiography. Previous imageto-video (I2V) approaches primarily focus on temporal continuity, yet often struggle to capture the intrinsic periodicity of cardiac motion, leading to limited temporal coherence and semantic consistency. We present EchoVDiff, a novel phase-aware diffusion model that reconstructs a full cardiac cycle from any single frame. Instead of direct pixel synthesis, EchoVDiff integrates physiological priors into a diffusion paradigm, learning interpretable mappings between cardiac phase, anatomy, and motion. By jointly modeling temporal rhythm and spatial semantics within a disentangled latent space, it achieves controllable and physiologically consistent generation. Extensive experiments on EchoNet-Dynamic and EchoNet-Pediatric demonstrate that EchoVDiff consistently surpasses state-of-the-art methods in both fidelity and temporal coherence. Remarkably, it enables accurate reconstruction of complete cardiac cycles from arbitrary phases, marking the first demonstration of single-frame-driven echocardiographic video generation.
Paperid: 2847,   Poster  
Authors: Senyan Xu, Shuai Chen, Chuanfu Shen, Kean Liu, Zhijing Sun, Chengzhi Cao, Xueyang Fu
Title: EventGait: Towards Robust Gait Recognition with Event Streams
Abstract: Gait recognition enables nonintrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity in conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition.Therefore, we propose EventGait, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structural Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets will be released.
Paperid: 2848,   Poster  
Authors: Peng Wang, Yongcai Wang, Wang Chen, Hualong Cao, Kang Yang, Chunxu Li, Wen Jie, Deying Li
Title: MOSAIC3D:Modular Scene Assembly for Real-Time 3D Segment Anything
Abstract: Online 3D instance segmentation is a critical capability for embodied agents navigating in dynamic environments. However, a fundamental challenge remains in adapting powerful 2D foundation models, like SAM, to 3D online segmentation. Naively lifting SAM's 2D masks to 3D results in severe spatial fragmentation, where a single object is shattered into multiple disconnected parts, especially under occlusion. Subsequent attempts to link these fragments over time via conventional 3D IoUbased tracking prove highly fragile: they struggle to handle occlusions or topological changes, ultimately causing catastrophic identity drift. Departing from such post-processing approaches, we reframe online segmentation as a learnable composition problem. We introduce MOSAIC3D, a differentiable framework that treats SAM-derived masks as "mosaic tiles" and learns to assemble them into temporally consistent 3D instances. MOSAIC3D comprises two key components: Fragment-to-Instance Adaptive Assembly that aggregates fragments through soft-gated attention, and Instance-to-Scene Online Merging that employs cascaded semantic-geometric matching to preserve object identities—replacing rigid IoU thresholds with learnable association guided by observation maturity. Evaluations on ScanNet, ScanNet200, SceneNN and 3RScan datasets demonstrate state-of-the-art performance and zero-shot cross-dataset generalization. Extensive ablation studies validate the effectiveness of the designed modules. The code will be available.
Paperid: 2849,   Poster  
Authors: Jie-En Yao, Hong-En Chen, C.-C. Jay Kuo
Title: HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
Abstract: Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The ForwardForward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +9.53%, respectively.
Paperid: 2850,   Poster  
Authors: Rui Lin, Chuanming Wang, Huadong Ma
Title: VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
Abstract: With the rapid development of pretraining technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph\ie image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams.To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling.To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts.Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization.
Paperid: 2851,   Poster  
Authors: Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov
Title: CrossView Splatter: Feed-Forward View Synthesis with Georeferenced Images
Abstract: We present CrossView Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced data sets and paired satellite--terrain data, mined from open mapping services.
Paperid: 2852,   Poster  
Authors: Xingyu Wang, Pengxiang Ding, Jingkai Xu, Donglin Wang, Zhaoxin Fan
Title: CUBic: Coordinated Unified Bimanual Perception and Control Framework
Abstract: Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such endto-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side—either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination—thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.
Paperid: 2853,   Poster  
Authors: Jegyeong Cho
Title: Global Information Thresholding for Sufficient and Necessary Circuits
Abstract: We study the problem of extracting causal circuitssmall edge-level subgraphs inside a trained network that are sufficient on their own and necessary to the model’s behavior under explicit error control. Prior work largely optimizes observational rankings or applies ad-hoc sparsification, which can sever paths, ignore inhibitory edges, and admit ``ghost" components that fail under intervention. We recast circuit discovery as information-constrained selection rather than ranking: a single global threshold chooses edges by their marginal contribution, combined with a null hypothesis-based statistical threshold to control family-wise errors. Edge scores are computed by rank-consistent attribution aligned to the task metric, stabilized with Fisher-diagonal variance normalization, projected to an edge coordinate system that preserves paths, and enforced with hard gates for interventional semantics. We propose an evaluation protocol that prioritizes sufficiency/necessity (CPR, CMD), editability, error rates, and standard ranking metrics. The result is a small, path-faithful circuit with reproducible selection criteria. Our motivation is to replace visually appealing heatmaps with interventional guarantees and explicit error control.
Paperid: 2854,   Poster  
Authors: Yukuan Min, Muli Yang, Jinhao Zhang, Yuxuan Wang, Yihang Zhu, Jiexi Yan, Cheng Deng
Title: Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation
Abstract: The latest progress in textto-3D generative models makes it possible to generate high-quality 3D content. Recent text-to-3D large model have achieved remarkable breakthroughs in multi-view consistency. However, their effectiveness is often affected by inherent biases, resulting in sensitivity to design settings such as prompt format, leading to difficulty understanding complex prompts. To help text-to-3D generative models understand more diverse prompts, we propose a framework to localize and mitigate the bias in the current text-to-3D large model. Specifically, we first use the existing model to generate 3D content and use the quality evaluation model to identify the cross-modality bias. Then, we use the predicted quality score to quantify the contribution of the prompt text to the bias. Finally, in order to reduce these biases, we construct diverse pairwise examples to help the current text-to-3D large model construct unbiased visual-text connections. The experiment shows that our method has achieved competitive results and can provide higher quality, more diverse 3D content compared to existing methods.
Paperid: 2855,   Poster  
Authors: Jingjing Zhang, Lei Zhang, Zheren Fu, Bo Hu, Zhendong Mao
Title: Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
Abstract: ZeroShot Composed Image Retrieval (ZS-CIR) aims to retrieve target images using a composed query of a reference image and a textual modification, without relying on triplet-based supervision. As the two inputs describe related but semantically unaligned information, the key challenge lies in interpreting their cross-modal discrepancy to infer the user’s intended semantic modification.Existing ZS-CIR methods mainly adopt a consistency-driven paradigm, training on semantically aligned image–text pairs with alignment or reconstruction objectives. This paradigm enforces cross-modal agreement but overlooks the semantic discrepancies between modalities that naturally arise during inference. To address this issue, we propose DiffComp (Differentiate-then-Compose), a difference-driven self-supervised framework that actively induces and exploits cross-modal discrepancies during training. It stimulates the model to perceive and reconcile semantic differences across visual and textual modalities, thereby improving consistency between training and inference. The framework consists of three components: Contextual Semantic Super-patches that provide localized and coherent visual representations for downstream perception and composition; Phrase-guided Masking that selectively removes text-aligned visual cues to induce controlled cross-modal discrepancies; and Difference-aware Composition that adaptively integrates visual and textual features according to their degree of semantic difference. Extensive experiments on four ZS-CIR benchmarks show that DiffComp achieves state-of-the-art performance and strong generalization.
Paperid: 2856,   Poster  
Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma
Title: ShadowDraw: From Any Object to Shadow–Drawing Compositional Art
Abstract: We introduceShadowDraw, a framework that transforms ordinary 3D objects into shadow–drawing compositional art. Given a 3D object, our system predicts scene parametersincluding object pose and lighting---together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow contour to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show thatShadowDrawproduces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow–drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling.For more results and an end-to-end real-world demonstration of our pipeline, please refer to theproject pagein the supplementary material.
Paperid: 2857,   Poster  
Authors: Han Li, Zehao Huang, Jiahui Fu, Naiyan Wang, Si Liu
Title: Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Abstract: Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pretrained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
Paperid: 2858,   Poster  
Authors: Mingfang Zhang, Yunhong Wang, Lu Wang, Jiaxin Chen
Title: Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition
Abstract: Parameterefficient fine-tuning (PEFT) has become a compelling approach for adapting large language models (LLMs) into multimodal large language models (MLLMs), enabling them to handle diverse modalities with substantially lower memory and computational costs. However, most existing PEFT methods neglect the issue of modality-imbalanced learning, which is characterized by the excessive dominance of text modality in updating parameters, thus incurring insufficient learning of non-text modalities and leading to performance degradation. To address this issue, we propose a novel parameter-efficient adaptation method for MLLMs, namely Implicit Modality Decomposition (IMoD), based on LoRA. It firstly decomposes the learnable parameters into the non-overlapped text-specific, non-text-specific and modality-sharing components, thereby alleviating modality imbalance. To further guide the optimization of these components toward specific modalities, we propose Modality-Specific Decoupling Constraint that suppresses cross-modal interference among modality-specific parameters, and Modality-Agnostic Alignment Constraint that encourages modality-sharing component to capture well-aligned, modality-invariant semantics. Extensive experiments across diverse multimodal settings and LLM architectures demonstrate that our method consistently delivers significant performance gains, particularly achieving an averaged 3.3% improvements on the audio-visual-text tasks without sacrificing the parameter and inference efficiency. We will release the source code upon acceptance.
Paperid: 2859,   Poster  
Authors: Shaolin Su, Josep Rocafort, Danna Xue, David Serrano-Lozano, Lei Sun, Javier Vazquez-Corral
Title: Bridging the Perception Gap in Image Super-Resolution Evaluation
Abstract: As superresolution (SR) techniques advance, we observe a growing distrust of evaluation metrics in recent SR research. An inconsistency often emerges between certain evaluation criteria and human perceptual preference. Although current SR research employs varying metrics to evaluate SR performance, it remains underexplored how robust and reliable these metrics actually are. To bridge this gap, we conduct a comprehensive analysis of widely used image quality metrics, examining their consistency with human perception when evaluating state-of-the-art SR models. We show that some metrics exhibit only limited—or even negative—correlation with human preferences. We further identify several intrinsic challenges in SR evaluation that compromise the effectiveness of both full-reference (FR) and no-reference (NR) image quality assessment (IQA) frameworks. To address these issues, we propose a simple yet effective Relative Quality Index (RQI) framework, which assesses the relative quality discrepancy between image pairs. Our framework enables easy integration and notable improvements for existing IQA metrics in SR evaluation. Moreover, it can be utilized as a valuable training guide for SR models, enabling the generation of images with more realistic details while maintaining structural fidelity.
Paperid: 2860,   Poster  
Authors: Yiming Cui, Liang Li, Haibing Yin, Yuhan Gao, Xichun Sheng, Chenggang Yan
Title: Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection
Abstract: Domain adaptive object detection (DAOD) aims to generalize an object detector trained on a source domain to a target domain, where the domain gap degrades the adaptability. Recently, largescale vision foundation models (VFMs), pretrained on web-scale datasets, exhibit such powerful generalization capabilities that many approaches leverage them to bridge the domain gap. However, their generalized knowledge is not tailored to the specific domain, which makes it difficult to offer precise guidance in the target domain. In this paper, we propose an Expert-Teacher-Student collaborative learning (ETS) framework to synergize the generalized knowledge from VFMs with the domain-specific knowledge from the teacher model. Concretely, we first design an Expert-Teacher Collaborative Teaching (ETCT) module, which leverages the complementary knowledge of expert and teacher models to collaboratively generate high-quality pseudo labels for supervising student model learning. Second, we devise an Expert-Teacher Joint Consolidating (ETJC) module, which introduces class-wise prototype alignment among expert, teacher, and student models, to jointly consolidate generalized and domain-specific knowledge within the student model. ETS leverages VFMs as the expert model in a free lunch manner, thus avoiding significant additional training costs. Extensive experiments exhibit that our method outperforms the existing SOTA methods on three benchmarks.Our code is available in the supplementary materials.
Paperid: 2861,   Poster  
Authors: Kexin Shi, Hanwen Liu, Zeyang Song, Yang Liu, Jieyuan Zhang, Shuai Wang, Jibin Wu, Malu Zhang, Yang Yang
Title: Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
Abstract: Spiking Neural Networks (SNNs) have gained significant attention due to their eventdriven computational paradigm, making them promising for neuromorphic computing. In recent years, the integration of SNNs and Transformer architectures has made remarkable progress in various tasks. However, existing spiking self-attention mechanisms predominantly focus on spatial information while neglecting explicit temporal modelling, leading to suboptimal performance. In this paper, we introduce the Temporal Interaction Coefficient (TIC) to analyze temporal dependency patterns in these spatial-only attention mechanisms, revealing their limited temporal interactions and restricted pattern diversity. To overcome this issue, we propose the Multi-Delay Mixer (MD-Mixer), drawing inspiration from time delay mechanisms in the nervous system. Specifically, MD-Mixer introduces multiple temporal delays to perform effective time mixing and facilitate temporally enriched spatial attention. In addition, it can be integrated seamlessly into existing Spiking Transformers as a drop-in replacement while maintaining energy efficiency. Extensive evaluations on static and neuromorphic benchmarks demonstrate that MD-Mixer substantially improves the performance of Spiking Transformers, outperforming existing state-of-the-art (SOTA) methods. This work establishes MD-Mixer as an effective and general solution for temporal modelling in event-driven architectures.
Paperid: 2862,   Poster  
Authors: Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
Title: SO-Bench: A Structural Output Evaluation of Multimodal LLM
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in realworld, agentic settings where outputs must not only be correct, but also conform to pre-defined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-BENCH benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-BENCH is built from over 6.5K diverse JSON schemas and 1.8K curated image–schema pairs with human-verified quality. Benchmarking experiments on open-sourced andfrontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model’s structured output capability. We plan to make the benchmark available to the community.
Paperid: 2863,   Poster  
Authors: Yongrui Ma, Shijie Zhao, Mingde Yao, Junlin Li, Li zhang, Xiaohong Liu, Qi Dou, Jinwei Gu, Tianfan Xue
Title: Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation
Abstract: Despite recent progress, diffusionbased video frame interpolation methods still struggle with large complex motions, resulting in discontinuous motions and inconsistent object appearances across frames. We observe that these limitations arise from both the current full-sequence interpolation strategy and the pixel reconstruction training objective. To solve these challenges, we propose ARVFI, a novel video diffusion-based interpolation method for large complex motion interpolation. Instead of generating all intermediate frames simultaneously, ARVFI interpolates in an autoregressive manner from two input frames to the middle ones. Thus, ARVFI interpolates a frame that is further away from the inputs based on all previous interpolation results, resulting in smoother motion transitions and better temporal consistency. Additionally, ARVFI further utilizes DINOv3 features as motion representations, which provide high-level semantics for accurate motion estimation, compared with a simple pixel-level loss. With all these designs, ARVFI generates the intermediate DINOv3 features first and then the frames with an effective conditional generation method for frames. Our ARVFI consistently outperforms existing methods with superior interpolation accuracy and visual quality.
Paperid: 2864,   Poster  
Authors: Wenguan Zhang, Qirun Zhang, Tuo Sun, Jiajian He, Jiahui Xu, Huajun Feng, Qi Li
Title: Lens Component Deletion based on Differentiable Ray Tracing
Abstract: To achieve compactness or cost reduction for optical lens systems, designers typically rely on commercial software to design lens systems independently of postprocessing algorithms, leading to excessive dependence on designers' expertise and often requiring significant time. Recently, joint optimization approaches utilizing differentiable ray tracing have emerged, demonstrating significant potential in lens design tasks. However, these existing pipelines fail to provide accurate and efficient diffraction modeling for complex refractive systems. In this work, we propose a novel lens component deletion pipeline for miniature optical systems, which automatically deletes the suitable lens component, and then optimizes both the lens system and the post-processing network to achieve joint aberration correction. Additionally, we introduce a novel metric for evaluating the contribution of each lens component within an optical system, aimed at identifying the lens component that has the least impact on the system. We also develop an efficient differentiable point spread function estimation method based on the Rayleigh-Sommerfeld diffraction model, significantly reducing GPU memory consumption. Our proposed pipeline does not rely on human design expertise, achieving lens component deletion while maintaining imaging quality comparable to the original lens system, thereby enabling the compactness or cost-effective optimization of optical systems.
Paperid: 2865,   Poster  
Authors: qiya song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang
Title: Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
Abstract: As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing ImageText Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. Although several studies have acknowledged the presence of noisy pairs, little work has explored how to endow neural networks with robustness against such noise. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image–Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present an enhanced triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates.
Paperid: 2866,   Poster  
Authors: Jiale Xu, Wang Zhao, Ying Shan
Title: MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
Abstract: Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language‑modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high‑poly meshes, and (ii) absence of geometry‑aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi‑level sparse‑voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross‑attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse‑to‑fine vertex prediction in a single decoding step, while tightly couples the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state‑of‑the‑art compression ratio of 18%, can generates meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.
Paperid: 2867,   Poster  
Authors: Junhao Hou, Chenqi Luo, PuFan Wang, Jiaying Lu, Yusheng Liu, Feiwei Qin, Meie Fang, Kun Zhou
Title: HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation
Abstract: Boundary representation (Brep) generation is a fundamental task in Computer-Aided Design (CAD), enabling automated modeling of 3D geometries. However, the direct synthesis of valid and high-quality B-reps remains a major challenge.Existing deep generative methods suffer from brittle representation and generation paradigms, due to: (1) representation noise from padding variable-length sequences and feature contamination between distant primitives, and (2) fragile generation pipelines marked by cascaded decoding error propagation and a train-inference mismatch from deferred validity enforcement.To address this, we propose HiFi-Brep. Our core insight is that robust, high-validity generation requires: first, building upon a compact and high-fidelity latent representation; and second, reformulating validity constraints as differentiable inductive biases within a single-stage generation process, enabling mutual guidance between geometry and topology.We implement this through a topology-aware encoder that yields a high-fidelity latent representation by eliminating padding noise via query-based pooling and preventing feature contamination with topology-guided attention. Our single-stage decoder then jointly generates geometry and topology, embedding core manifold constraints into a differentiable learning objective to ensure topological validity and sidestep cascaded errors. The resulting latent space supports both unconditional synthesis and conditional generation from various inputs, such as class labels, point clouds, or images.Experiments demonstrate that HiFi-Brep outperforms state-of-the-art approaches in both model validity and geometric quality.
Paperid: 2868,   Poster  
Authors: Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo
Title: Revisiting Model Stitching In the Foundation Model Era
Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitchlayer, has served as a probe of representational compatibility. Prior work finds that models trained on the same datasetremain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss endto-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model’s penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.
Paperid: 2869,   Poster  
Authors: Yuanxiang Huangfu, Chaochao wang, weilei wang
Title: Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach
Abstract: The effectiveness of Contrastive LanguageImage Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs by enriching each image with multiple complementary captions, while keeping the number of training images fixed. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million images achieves a Recall@1 of \mathbf64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models will be made publicly available.
Paperid: 2870,   Poster  
Authors: Yiming Wu, Chenghao Chen, Wu kun, Chong Fu, Biru Zhu, Zhenyu Wen, Zhen Hong
Title: Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
Abstract: Deep learning models are well known to be susceptible to backdoor attacks, and textto-image generation models are no exception. When a specific trigger is embedded in the input, a backdoored model can be manipulated to perform attacker-defined malicious behaviors, such as generating harmful or inappropriate images. Existing backdoor attacks on text-to-image generation models are largely limited to dirty-label attacks, where misaligned image-caption pairs are injected into the training data. While effective in controlled settings, such methods are often easily detectable, limiting their practicality in realistic applications. To address this limitation, we propose the first clean-label backdoor attack for text-to-image generative models, which preserves semantic consistency within poisoned image-caption pairs to evade detection. We design a dual-modality manipulation strategy that injects nearly imperceptible noise into images while embedding a composite semantic text trigger. The text trigger combines synonym substitution and syntactic restructuring, enabling stealthy yet effective backdoor implantation without compromising the visual–textual alignment. Experimental results demonstrate that our method achieves high attack success while effectively preserving model utility and evading mainstream defenses, including commercial content filters.
Paperid: 2871,   Poster  
Authors: Amr Sharafeldin, Aryan Mikaeili, Thomas Walker, Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi
Title: Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
Abstract: Current generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photorealistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh structure enables direct spatial regularization that prevents artifacts caused by inconsistent supervision across views or occlusion, which affect similar approaches for other point-based representations.We show that our method achieves superior performance on object-level segmentation compared to Gaussian Grouping and SAGA.
Paperid: 2872,   Poster  
Authors: Junpeng Shang, Feifei Shao, Jun Xiao, Lin Li, Hongwei Wang, Dongfang Ma
Title: PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
Abstract: 3D visual grounding (VG) aims to localize target objects in 3D scenes based on freeform textual descriptions. Existing 3D VG models predominantly employ point-based backbones for point cloud feature extraction. Such methods require aggressive downsampling of the input point cloud, which sacrifices the fine-grained spatial details crucial for precise localization. This paper proposes PV-Ground, a novel 3D VG architecture based on effective text-guided point-voxel feature interaction. Our method leverages the complementary strengths of both voxels and keypoints: it employs a voxel-based feature extraction backbone to preserve high-resolution spatial details, while utilizing compact keypoints to aggregate these features for efficient, deep interaction with the textual query. Furthermore, we propose a text-guided keypoint sampling module to adaptively concentrate the keypoint distribution around the text-described object, enabling task-specific feature aggregation and significantly boosts model performance. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method. Our method achieves a performance improvement of 5.1% on the ScanRefer dataset and 5.6% on the ReferIt3D dataset, while also achieves over 4% improvement in the segmentation task. The code will be made publicly available.
Paperid: 2873,   Poster  
Authors: Yaohou Fan, Qingzhong Wang, Yongsong Huang, Junyi Liu, Tomo Miyazaki, Shinichiro Omachi
Title: POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
Abstract: Current visual text generation models struggle with the tradeoff between text accuracy and overall image coherence like aesthetic appeal during training. We find that to achieve a high text accuracy could reduce aesthetic score and instruction following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient and effective training is an unsolved problem.In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.
Paperid: 2874,   Poster  
Authors: Xulin Li, Yan Lu, Bin Liu, Qinhong Yang, Qi Chu, Tao Gong, Nenghai Yu
Title: MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID
Abstract: Visibleinfrared person re-identification (VI-ReID) is a challenging task due to the significant modality discrepancy between visible and infrared images. We contend that the discrepancy primarily arises from varying lighting conditions of the two modality data, including differences in the wavelengths of light and the types of light source. Recently, frequency-based VI-ReID approaches have achieved notable success, since frequency information can more effectively extract contours and details pertinent to identity while excluding irrelevant lighting and color. However, existing methods do not distinguish different frequency bands or focus solely on a particular frequency band, which is insufficient for capturing the inherent variations in frequency under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different frequencies through a mixture-of-experts method. In addition, we further introduce a Random Frequency Augmentation (RFA) and a Frequency Auxiliary Optimization (FAO) to effectively train the MFEN in mining frequency information. The proposed three frequency modules are complementary to each other and adaptively capture critical frequency domain details to achieve robust representations. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.
Paperid: 2875,   Poster  
Authors: Shuoyi Chen, Yurui Wu, Mang Ye
Title: Object-Generalized Re-Identification: A Step Towards Universal Instance Perception
Abstract: The object reidentification (ReID) task aims to recognize the same individual object across diverse viewpoints and sensing conditions.Although person and vehicle ReID have achieved remarkable success, most existing methods are built on the assumption that training and testing data come from the same object category.This constraint requires separate models for each category, which limits scalability and generalization.To address this limitation, we introduce Object-Generalized Re-Identification (OG-ReID), a new paradigm that learns unified identity representations transferable across different object categories.Unlike conventional domain generalization that focuses on appearance variations within a single category, OG-ReID deals with category shifts caused by intrinsic structural differences in identity cues. To achieve this goal, we introduce the Meta-Generalized Object Re-Identification (MGOR) framework, which treats meta-learning as semantic distributional regularization, exposing the model to controlled category shifts so that invariance emerges as an equilibrium between semantic diversity and identity discrimination.Extensive evaluations on more than 100 unseen object categories from multiple domains show that MGOR outperforms existing ReID approaches without any target-domain adaptation, advancing toward universal identity perception beyond domain and category boundaries.
Paperid: 2876,   Poster  
Authors: Zhong Muyan, Erfei Cui, Sen Xing, Weiyun Wang, Wen Wu, Yuchen Hu, Yanting Zhang, Xiaowei Hu, Wenhai Wang, Chao Zhang, Jifeng Dai
Title: HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction
Abstract: Multimodal large language models (MLLMs) have expanded from vision–language systems to include audio, unlocking new capabilities in crossmodal reasoning and interaction. To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). It systematically evaluates the audio-related capabilities of MLLMs along a three-level cognitive hierarchy: Perception, Reasoning, and Interaction, utilizing 2,451 curated samples and manually annotated multi-turn interaction-level tasks. Experiments using this unified framework reveal significant gaps in existing models at the reasoning and interaction levels, with speech-driven visual question answering (VQA) performance significantly lagging behind the text–image setting. These findings underscore the urgency of enhancing models’ handling of long and complex audio and facilitating the transfer of reasoning capabilities from the vision–text to the audio–visual domain.
Paperid: 2877,   Poster  
Authors: hongyuan chen, Xingyu Chen, Zexiang Xu, Anpei Chen
Title: Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
Abstract: We present Motion 3to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work.
Paperid: 2878,   Poster  
Authors: hongyu peng, Xiang Yuan, Gong Cheng
Title: PGA: Prior-free Generative Attack for Practical No-box Scenario
Abstract: The unrealistic reliance on abundant prior information in traditional transferable attacks has spurred the Practical Nobox Scenario (PNS), where attackers can access only limited unlabeled images. However, existing methods rely on iterative optimization to produce adversarial examples with inherently limited inference speed and transferability. Conversely, faster generative attacks fundamentally conflict with the PNS due to their critical dependence on abundant prior information that is explicitly absent in this scenario. To bridge this gap, we propose Prior-free Generative Attack (PGA), the first generative attack tailored for the PNS. Specifically, we introduce the Curriculum-Guided Micro-Robust Optimization that progressively incorporates more challenging discriminative tasks to mitigate the degenerate solutions common in self-supervised learning with limited data, yielding robust and transferable surrogates for downstream attacks. Furthermore, the Region-Aware Consistent Perturbation Learning guides the generator to produce fine-grained and spatially coherent perturbations, mitigating the common pitfall of generative attacks falling into local optima under insufficient supervision. Extensive experiments demonstrate that our PGA achieves remarkable transferability across various settings with high inference speed. This work provides a more practical benchmark for future research on transferable attacks, revealing the great potential of generative attacks under the PNS.
Paperid: 2879,   Poster  
Authors: Jun-hao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue
Title: FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
Abstract: Diffusion models have recently advanced video restoration, but applying them to realworld and AIGC-generated video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and near real-time performance. To this end, we proposeFlashVSR, the first diffusion-based one-step streaming framework for efficient video super-resolution. FlashVSR runs at approximately 17 FPS for 768×1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that reduces redundant computation while bridging the train–test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also constructVSR-120K, a new dataset containing 120K videos and 180K images. Extensive experiments demonstrate that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to approximately 12× speed-up over prior one-step diffusion-based VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based video super-resolution.
Paperid: 2880,   Poster  
Authors: Jinwoo Kim, Jihye Yoo, Seon Joo Kim
Title: Learning Personalized Photographic Style from Pairwise User Preferences
Abstract: Photographic style preferences are deeply personal, varying across individuals in color and tonal aesthetics. We introduce Personalized Photographic Style (PPS) learning, where the goal is to capture a user's implicit preferences from comparative judgments and apply them consistently across diverse images. To establish a foundation for this problem, we present three contributions. First, we introduce PPSD, a dataset containing pairwise preference judgments from 767 users, each providing an average of 70 comparisons. To capture diverse style signals, images are sourced from professional edits, device pipelines, and generative models. Second, we explore several baseline models demonstrating the feasibility of adapting style transfer and enhancement approaches for preference learning. Third, we develop a comparative evaluation framework suited to the implicit nature of personal preferences. We will make our dataset publicly available, and hope this work serves as a foundation for advancing research in personalized photographic style learning.
Paperid: 2881,   Poster  
Authors: Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu
Title: FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Abstract: This paper presents FluxMem, a trainingfree and adaptive memory framework for efficient streaming video understanding. FluxMem progressively compresses visual memory through a hierarchical redundancy reduction process. Specifically, Temporal Adjacency Selection (TAS) removes redundant tokens across adjacent frames to alleviate temporal redundancy, while Spatial Domain Consolidation (SDC) further merges spatially repetitive regions within each frame into compact representations. To ensure robustness across diverse scene dynamics, both modules employ adaptive thresholds derived from intrinsic scene statistics, automatically adjusting the compression rate without manual tuning. Extensive experiments demonstrate that FluxMem establishes a new state of the art on online benchmarks, achieving 76.4 on StreamingBench and 66.3 on OVO-Bench in real time. Furthermore, it exhibits strong offline performance, attaining 73.1 on MLVU while using 65% fewer visual tokens.
Paperid: 2882,   Poster  
Authors: Sriram Narayanan, Ziyu Jiang, Srinivasa G. Narasimhan, Manmohan Chandraker
Title: PhyCo: Learning Controllable Physical Priors for Generative Motion
Abstract: Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a largescale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision–language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes—without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.
Paperid: 2883,   Poster  
Authors: Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, Xipeng Qiu
Title: LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models
Abstract: Visual–Language–Action (VLA) models report impressive success rates exceeding 95% on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. Current simulationbased robustness evaluations suffer from narrow perturbation coverage, manual design constraints, and coarse-grained analysis that fails to reveal when and how models fail. To address this gap, we propose LIBERO-Plus, a comprehensive, automatic, and fine-grained evaluation framework with controlled perturbations across seven dimensions: object layouts, camera viewpoints, robot initial states, language instructions, lighting conditions, background textures, and sensor noise. Our systematic analysis of ten state-of-the-art models reveals consistent brittleness beneath apparent competence, with performance dropping from 95% to below 30% under modest perturbations. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
Paperid: 2884,   Poster  
Authors: Zilai Zeng, Mingdeng Cao, Zijie Li, Xiaochen Lian, Yichun Shi, Peihao Zhu, Chen Sun, Peng Wang
Title: Towards Robust Sequential Decomposition for Complex Image Editing
Abstract: Recent advances in visual generative models have enabled highfidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.
Paperid: 2885,   Poster  
Authors: Ronggang Huang, FanSen Meng, Huaidong Zhang, Xuemiao Xu
Title: ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
Abstract: 3D visual grounding task aims to accurately identify and grounding target objects in 3D space based on natural language descriptions,where the effective exploitation of relative relations between the target and anchor is crucial.However, in existing methods, relative relations are often tightly entangled with entity semantics. This tight coupling encourages models to rely on semantic shortcuts from entity names, making it difficult to maintain good generalization under multiview and complex multi-object scenarios.To address this, we propose an object–relation decoupling framework that treats target–anchor relations as first-class geometric and semantic primitives and models them explicitly.First, we construct a scene-level relative geometric representation that encodes the direction and distance between the target and anchor, and introduce a scene-level hyper-object token as a unified prior for scale and viewpoint.Second, we develop a predicate-decoupled cross-modal alignment strategy that preserves only predicates carrying spatial relational semantics while masking out all other tokens, thereby suppressing semantic leakage from entity names.Finally, we design an anchor-guided regression module that predicts auxiliary anchors and samples their features to guide the model in learning entity semantics from text, explicitly injecting target–anchor priors and effectively resolving ambiguities in complex multi-object scenes.Extensive experiments on multiple 3D visual grounding benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches and exhibits strong robustness and generalization under challenging multi-view and relation-intensive settings.
Paperid: 2886,   Poster  
Authors: Siyuan Liu, Chaoqun Zheng, Xin Zhou, Tianrui Feng, Dingkang Liang, Xiang Bai
Title: PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding
Abstract: Scenelevel point cloud understanding remains challenging due to diverse geometries, imbalanced categories, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static parameters during inference, limiting their adaptability to dynamic scene data. We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. PointTPA uses a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while keeping parameter cost low. Integrated into PTv3, PointTPA reduces trainable parameters by over 95% and achieves competitive or superior performance to full fine-tuning. It achieves 74.9% mIoU on S3DIS and consistently surpasses existing PEFT baselines across multiple benchmarks, highlighting the efficacy of test-time dynamic parameter generation in enhancing robust 3D scene understanding. The code will be available soon.
Paperid: 2887,   Poster  
Authors: Ye Liu, shouyiliu shouyiliu, Huiyu Yang, Jianghang gu, fanwenhao fanwenhao, Zhongxin Yang, Ding Wang, Simeng Chen, Zirun Jiang, Yuanwei Bin, Shiyi Chen, Yuntian Chen
Title: AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design
Abstract: Modern generative models can propose striking 3D vehicle shapes from text and images, but turning these sketches intoaerodynamically efficient, regulationcompliant designs still requires weeks of high-fidelity computational fluiddynamics (CFD) and manual iteration. As a result, fast 3D generation without trustworthy physics in the loop doeslittle to reduce end-to-end design time. We study how an AI agent can close this loop under a strict CFD budget.We introduce AeroAgent, a vision–physics–decision framework built around a single 3D, editable surfacerepresentation for vehicle shapes. A vision module turns text and 2D references into diverse, standardized 3Dcandidates and supports image-level edits. A physics module, AeroFormer, is a geometry-guidedTransformer surrogate trained on a large-scale vehicle aerodynamics dataset of roughly 50k CFD simulations; threetask-specific heads predict drag (C_d), surface pressure, and velocity fields on shared 3D grids. A decision module encodesregulatory size limits and aesthetic constraints as feasibility tests, uses prototype priors and surrogate sensitivitiesto guide free-form deformation edits, and runs a budget-aware propose–evaluate–refine loop in which only the finaltop-K shapes are confirmed by high-fidelity CFD.In extensive experiments across five common vehicle classes, running only five propose–evaluate–refine iterations per vehiclereduces drag by an average of 2–12% and cuts high-fidelity CFD calls by 50–80% compared to baseline workflows, whilepreserving or improving styling quality.
Paperid: 2888,   Poster  
Authors: Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Abhinav Mehrotra
Title: RFDM: Residual Flow Diffusion Models for Video Editing
Abstract: Autoregressive video generative methods have recently become popular due to their flexibility for variablelength video generation and computational efficiency. However, their deployment in video editing remains relatively unexplored. This paper introduces an efficient causal video editing model that edits a video frame-by-frame. Specifically, we adapt an image-to-image (I2I) model to video-to-video (V2V) where editing at time frame t is conditioned on the model prediction on t-1. To make use of the past predictions more effectively, we condition the sampling noise on the past prediction during the diffusion forward process. Our forward process guides the model to explicitly compute the residual between the target and the previous prediction during denoising; we denote this formulation as the Residual-Flow Diffusion Model, RFDM. We initialize RFDM with text-to-image SD1.5 model, and train on the Señorita dataset for global style transfer, local style transfer, and object removal. RFDM achieves competitive results with computationally heavy counterparts while being significantly more efficient. The latency of our method scales linearly with the number of frames, making it the most efficient diffusion-based video editing framework.
Paperid: 2889,   Poster  
Authors: Chang Liu, Tianjiao Jing, Chengcheng Ma, Xuanqi Zhou, Zhengxuan Lian, Qin Jin, Hongliang Yuan, Shi-Sheng Huang
Title: EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
Abstract: Recent photorealistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.
Paperid: 2890,   Poster  
Authors: Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi
Title: FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
Abstract: Achieving humanlevel performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we proposeFantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
Paperid: 2891,   Poster  
Authors: Meiqi Cao, Jiachao Zhang, Xin Jiang, Rui Yan, Yazhou Yao, Zechao Li, Xiangbo Shu
Title: Seeing Motion Through Polarity for Event-based Action Recognition
Abstract: Eventbased Action Recognition (EAR) provides a promising pathway for understanding dynamic behaviors under challenging conditions. Recent progress in vision-language models has introduced a cross-modal learning paradigm into EAR, enabling models to associate event streams with textual semantics for enhancing conceptual understanding. However, existing methods typically overlook the intrinsic polarity-driven motion cues that are fundamental to event data, leading to suboptimal spatiotemporal representations. To address this limitation, we propose a POlarity Knowledge Enhanced framework (POKER), which explicitly incorporates event polarity-aware motion knowledge across visual and textual modalities.POKER consists of two synergistic components: Polarity Motion Capturer (PMC) and Polarity Motion Reasoner (PMR). Specifically, PMC decouples positive and negative polarities to capture polarity-sensitive motion cues, while PMR semantically analyzes polarity-induced motion dynamics via large language models. Through the polarity alignment, POKER couples semantic reasoning with visual dynamics, achieving more discriminative representations. Extensive experiments on multiple benchmarks demonstrate that POKER enhances performance across diverse event representations.
Paperid: 2892,   Poster  
Authors: Juncan Deng, Kejie Huang
Title: VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models
Abstract: Posttraining quantization (PTQ) emerges as a vital technique for efficiently compressing large-scale models, with weight-compensation methods like GPTQ (symmetric calibration) and GPTAQ (asymmetric calibration) showing remarkable success. However, directly applying these methods to Vision-Language Models (VLMs) reveals two critical limitations: 1) their reliance on standard rounding is suboptimal for the asymmetric objective, failing to account for residual-induced shifts in the optimal quantization target; and 2) they uniformly process input channels across modalities, overlooking the distinct information densities of vision and language tokens. In this paper, we introduce VLM-PTQ, a new PTQ asymmetric framework for VLMs. First, we derive a closed-form correction term for the quantization point, which explicitly accounts for the propagated residual and the corresponding inverse Hessian column, yielding a better local optimum than standard rounding. Second, we propose a modality-aware quantization that differentiates channel importance between vision and language tokens, enabling the quantizer to prioritize salient channels through a lightweight fusion coefficient search. Our method extends GPTAQ with minimal overhead while achieving significant performance improvements in low-bit scenarios. Extensive experiments demonstrate that VLM-PTQ achieves state-of-the-art results, effectively compressing models from 1B to 72B parameters on a single GPU.
Paperid: 2893,   Poster  
Authors: Yuyang Ji, Yixuan Shen, Shengjie Zhu, Yu Kong, Feng Liu
Title: From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
Abstract: We present BioCoach, a biomechanicsgrounded vision–language framework for fitness coaching from streaming video. BioCoach fuses two signals, visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision–biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.
Paperid: 2894,   Poster  
Authors: Jun Wei, Hui Huang
Title: Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation
Abstract: Pixellevel annotations for medical image segmentation are costly and labor-intensive, often requiring expert knowledge. Bounding box labels provide a more scalable alternative but introduce strong box-shaped bias that hampers segmentation quality. We propose WeakMed, a general-purpose weakly supervised segmentation framework that removes the dependence on pixel-level masks while overcoming the structural limitations of box supervision. WeakMed introduces two lightweight, plug-and-play training components: (1) a Mask-to-Box (M2B) transformation that aligns predicted masks with box annotations to reduce label mismatch and box-induced bias, and (2) a Scale Consistency (SC) loss that enforces multi-scale self-supervision to address the ambiguity and instability of weak labels. Both modules are used only during training and impose no inference overhead. Across 9 segmentation tasks, 10 datasets, and 6 imaging modalities, WeakMed consistently surpasses existing weakly supervised methods and achieves performance competitive with fully supervised baselines. These results demonstrate its practicality as a low-cost yet high-quality solution for medical image segmentation. Codes will be released.
Paperid: 2895,   Poster  
Authors: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao
Title: C$^3$R: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning
Abstract: Multimodal Large Language Models (MLLMs) suffer from a fundamental "modality gap", contradicting themselves on visual versus text views of the same content. This paper argues that this inconsistency is not a failure, but a powerful resource for selfreward multimodal learning. Instead of relying on flawed voting mechanisms that amplify systematic errors when the majority is wrong, we introduce cross-modal cycle consistency as rewards C^3R to improve multimodal reasoning. C^3R performs backward inference from an answer to a query, switches modalities, and performs forward inference to verify the answer's consistency. This cycle serves as a dense, label-free reward that guides the model to resolve its own internal conflicts, while avoiding majority-is-wrong failures of standard voting methods. On standard benchmarks, C^3R mitigates modality-specific biases and improves reasoning accuracy by up to 7.6%. Our results show that robust reasoning emerges not just from scaling data, but from achieving a bidirectional understanding of the multimodal world.
Paperid: 2896,   Poster  
Authors: Chaodong XIAO, Zhengqiang ZHANG, Lei Zhang
Title: BinaryAttention: One-Bit Attention for Vision and Diffusion Transformers
Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2× faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit transformers for vision tasks. The codes and models will be made publicly available.
Paperid: 2897,   Poster  
Authors: Haiwei Wu, Fengpeng Li, Zhilin Tu, Yuanman Li, Xiong Li, Jiantao Zhou
Title: Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment
Abstract: Advances in generative AI (GenAI) have increasingly complicated the identification of synthetic images, prompting the proposal of numerous zero/few-shot detection methods to counter unknown GenAI better. However, we observe that existing detectors often misclassify synthetic images with physical transformations (e.g., print+scan) as real. The essence of this observation lies in: should images remapped from the physical world to digital space still be categorized as ``Synthetic''? Furthermore, the definition of what constitutes real and synthetic images urgently needs to be clarified. We first boldly propose that the authenticity of an image depends on whether it originates from the physical world, i.e., it is necessary to verify the original correlation between the digital image and the physical world. To this end, we first analyze the physical-to-digital mapping process: illumination signals are captured by camera sensors as RAW data, which is then converted into RGB data via camera internal parameters. This process embodies unique physical cues inherent to real scenes. Based on this, we propose a novel forensic feature termed alignment trace, which is constructed by modeling a shared RAW-RGB feature space. This trace captures the inherent parameter correlations of real images in the physical-to-digital conversion process, thereby indirectly verifying the physical origin of the image. Experiments demonstrate that our method achieves state-of-the-art zero-shot detection using only real RAW-RGB data pairs. When additional prior knowledge is provided, the method can be easily fine-tuned to achieve better cross-domain detection performance. We hope this work provides a new baseline for zero-shot synthetic detection and, more significantly, inspires the forensics community to explore the essential distinctions between real and synthetic images.
Paperid: 2898,   Poster  
Authors: Qingsong Xie, Zhenyi Liao, Chen Chen, Zhijie Deng, Haonan Lu
Title: LogCD: Local-to-global Consistency Distillation for Few-step Image Generation
Abstract: Distilling latent diffusion models (LDMs)/rectified flow models (RFMs) into ones that are fast to sample from conditions is attracting huge interest. However, the majority of existing methods either need significant training resources or lead to quality degradation, especially in textimage alignment. To address these challenges, we propose Local-to-global Consistency Distillation (LogCD) to accelerate LDMs/RFMs via two-stage distillation.LogCD first performs local consistency distillation and then executes global consistency distillation to ensure the consistency along inference path.Besides, Latent Learned Perceptual Image Patch Similarity model is exploited to enhance perceptual consistency.Notably, LogCD exhibits high flexibility, allowing a single unified model to operate with 2 to 4 sampling steps. The model's performance improves seamlessly as the number of steps increases within this range.With only 70 A100 GPU hours, LogCD accelerates SDXL to achieve a 33.5 CLIP score with just 3 sampling steps, surpassing state-of-the-art accelerated models using even more steps. FLUX.1-dev accelerated by LogCD with 4-step sampling presents comparable performance to 25-step teacher model, with CLIP score of 32.6.
Paperid: 2899,   Poster  
Authors: Paul Mattes, Jan Schwab, Jens Bosch, Maximilian Li, Nils Blank, Minh-Trung Tang, Moritz Haberland, Rudolf Lioutikov
Title: SIR: Structured Image Representations for Explainable Robot Learning
Abstract: Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions.Thus, the representations that drive their behaviour are often opaque, making their decisionmaking process difficult to interpret.To address this, we introduce Structured Image Representation, a method that leverages Scene Graphs as an intermediate representation for robot policy learning.Our approach first constructs a fully connected graph, using 2D or 3D image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a minimal, task-relevant sub-graph that is passed to the action generation model.This process makes our model intrinsically explainable.Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate.We also demonstrate that our graph-based representations are significantly more robust to distractor objects, showing almost no performance degradation, as opposed to image representations.Most importantly, we show that the learned sparse graphs are a powerful tool for introspection.By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases.
Paperid: 2900,   Poster  
Authors: Jiayi Yang, Guancheng Wan, Man Zhang, Mang Ye
Title: RAAS: LLM Agentic System Architecture Search with GRPO
Abstract: Large Language Model (LLM) agentic systems solve complex tasks through coordinated workflows, but designing them remains laborintensive. The Agentic Supernet paradigm automates this by optimizing a probabilistic architecture space, yet suffers from critical evaluation instabilities: absolute performance scores entangle architectural merit with query difficulty, while single-execution protocols capture execution randomness rather than true capability. These instabilities lead to unreliable search dynamics where simple queries inflate weak designs and challenging queries suppress strong ones. We introduce RAAS (Robust Architecture Adaptive Search), which establishes stable, fair evaluation through two synergistic mechanisms. Contextual Architecture Orchestration (CAO) disentangles quality from task difficulty by evaluating cohorts of candidate architectures on identical queries, deriving context-aware merit signals through peer-group comparison. Multi-Trial Assessment Synthesis (MTAS) eliminates execution variance by aggregating performance across multiple independent trials, producing statistically robust capability estimates. Together, these mechanisms isolate genuine architectural superiority and guide reliable architecture discovery. Extensive experiments across six benchmarks show RAAS significantly outperforms state-of-the-art methods, improving HumanEval pass@1 from 92.23% to 96.31% and MATH accuracy from 52.08% to 60.87%, while maintaining practical efficiency, demonstrating the effectiveness of robust evaluation for agentic architecture search.
Paperid: 2901,   Poster  
Authors: Jiaze Xu, Shiyu Xia, Jiaqi Lv, Xin Geng
Title: Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization
Abstract: Appropriate parameter initialization is crucial for reducing the training cost of deep neural networks. Graph HyperNetworks (GHN) have emerged as a promising approach for initializing diverse architectures, with recent methods such as TaskAware Learngene (TAL) further attempting to leverage pre-trained model knowledge via soft label supervision. However, such indirect supervision fails to fully exploit the rich information encoded in pre-trained weights. We proposeParameterInheriTanceHyperNetwork (PITH), which introduces a novel parameter projection mechanism to directly inherit parameters from pre-trained models for initializing target networks of varying configurations. Our method enables initialized networks to directly achieve competitive performance on downstream tasks without any further training, which we term zero-shot initialization. Extensive experiments demonstrate the superiority of PITH: ViT-Base initialized by PITH achieves 53.35% zero-shot accuracy on ImageNet-1K, surpassing the previous state-of-the-art by 6.54%, with consistent improvements across multiple downstream tasks.
Paperid: 2902,   Poster  
Authors: Jiayao Tan, Fan Lyu, Tianle Liu, Fuyuan Hu, Wei Feng
Title: Towards Dynamic Modality Alignment in Multimodal Continual Learning
Abstract: Multimodal Continual Learning (MMCL) aims to enable models to continuously accumulate knowledge across multiple tasks and modalities without forgetting prior information. MMCL presents more challenges than singlemodal continual learning, as it requires effective cooperation and complementarity between modalities. Existing methods often treat modality alignment as a static process, assuming once alignment is established, it remains fixed. However, we argue that modality alignment is inherently dynamic, evolving with task learning and feature propagation across layers. To address this, we introduce Dynamic Alignment Graph Regularization (DAGR), a novel approach that explicitly models the evolving alignment across layers. By incorporating multi-level graph regularization, our method stabilizes the alignment process and mitigates catastrophic forgetting. Extensive experiments on benchmarks, such as MTIL, show that DAGR outperforms static alignment-based methods and other continual learning techniques, achieving superior stability.
Paperid: 2903,   Poster  
Authors: Yitong Jiang, Collin McCarthy, Hongjun Wang, Hanrong Ye, Qi Dou, Tianfan Xue, Jinwei Gu, Jan Kautz, Danny Yin, Pavlo Molchanov, Sifei Liu
Title: Scaling Parallel Sequence Models to Vision Foundation Models
Abstract: Scaling vision foundation models is constrained by the quadratic complexity of selfattention. Although subquadratic attention alternatives like linear attention variants and state-space models successfully reduce the model complexity, they typically serialize images into 1D token sequences, compromising spatial coherence and efficiency. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative that propagates context directly on the 2D grid via line-scan propagation and removes positional embeddings, yet the original design hits GPU-scaling limits: growing batch/channels saturate SM concurrency, serializing scans, and spiking latency. We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10×. We further improve efficiency with lightweight projections and fused CUDA kernels. To enable large-scale pretraining, we adopta two-stage cross-operator distillation strategy that combines layer-wise supervision with end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN achieves up to 2× speedup, maintains competitive zero-shot accuracy, and improves segmentation by +2.1%. Extensive experiments and ablations show that the proposed compression and two-stage distillation are criticalfor strong transfer while substantially reducing compute, enabling the first extension of a subquadratic operator to foundation-scale (CLIP-style) vision pretraining.
Paperid: 2904,   Poster  
Authors: Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu ZHOU
Title: DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
Abstract: GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instructionrelevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.
Paperid: 2905,   Poster  
Authors: Mengzhu xu, Hanzhi Liu, Ningkang Peng, qianyu Chen, Canran Xiao
Title: Affordance-First Decomposition for Continual Learning in Video–Language Understanding
Abstract: Continual learning for videolanguage understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.
Paperid: 2906,   Poster  
Authors: Zhihao Zheng, Jinglun Feng, Nirav Savaliya, Zheng-Hang Yeh, Bo Lang, Mooi Chuah
Title: TextFM: Robust Semi-dense Feature Matching with Language Guidance
Abstract: Feature matching is a critical task in geometric perception, yet existing methods often struggle under domain shifts and illumination changes due to reliance on visualonly learning and expensive 3D supervision. In this paper, we present TextFM, the first language-guided feature matching framework that incorporates domain-invariant semantic information from vision-language models (VLMs). Built upon a detector-free architecture, TextFM leverages textual embeddings as instance-level queries to provide global semantic context during coarse-level matching, enhancing robustness in challenging scenarios such as textureless surfaces and cross-domain shifts. Additionally, we integrate illumination-invariant physical priors and apply Low-Rank Adaptation (LoRA) to efficiently fine-tune Vision Foundation Models (VFMs) for more robust visual feature extraction. Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods. In addition, we contribute a synthetic day-night matching benchmark for rigorous evaluation under extreme lighting conditions. Together, our method and dataset establish a strong foundation for robust and generalizable feature matching under real-world constraints.
Paperid: 2907,   Poster  
Authors: Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacón, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang
Title: XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR
Abstract: Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XRPoser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, XR-Poser outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%.
Paperid: 2908,   Poster  
Authors: Lei Wang, Yang Cheng, Senmao Li, Ge Wu, Yaxing Wang, Jian Yang
Title: WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
Abstract: Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multistep diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose theLow-rankRotation of weightDirection (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting inWeight Direction-awareDistillation (WaDi)—a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.
Paperid: 2909,   Poster  
Authors: Shaowu Xu, Xibin Jia, Chao Fan, Junyu Gao, Jing Chang, Qianmei Sun
Title: Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition
Abstract: Intricate correlations among atomic actions and inherent visual confounders in longterm action recognition (LTAR) contribute to the persistent challenges in this domain. While methods based on vision-language models that employ label text for supervision offer potential for handling visual confounders, their reliance on statistical correlations rather than causal mechanisms introduces two vulnerabilities: (1) spurious alignments with non-causal co-occurring visual features during cross-modal interaction, and (2) misinterpretation of codependencies among actions. To address these limitations, this paper introduces Progressive Cross-Modal Causal Intervention (PCMCI). PCMCI first mitigates co-occurrence hallucination via causal intervention grounded in optimal transport theory. Subsequently, an action relation-aware mechanism counters the backdoor path induced by codependency illusion, enabling the derivation of deconfounded text embeddings. Finally, these deconfounded embeddings serve as mediator to implement front-door adjustment to remove visual confounders. This progressive causal intervention framework facilitates learning robust representations for LTAR. Experiments on three long-term action benchmarks demonstrate the effectiveness of the proposed model.
Paperid: 2910,   Poster  
Authors: Hee Min Choi, Hyoa Kang, Suji Kim, Dokwan Oh, Nam Ik Cho
Title: Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
Abstract: Spacetime video super-resolution (STVSR) is a task aimed at simultaneously upsampling a video in both spatial and temporal dimensions. Previous studies on STVSR have primarily focused on task-specific architectures and modeling paradigms, while effective pretraining strategies remain underexplored. In this paper, we propose a pseudo-temporal space–time reconstruction pretraining framework for STVSR networks that enables effective use of image datasets, which naturally provide strong spatial cues. Each training sample is constructed by duplicating a single image into a pseudo-temporal video and independently zero-filling random pixel regions across its frames. Instead of designing a separate pretraining module, we pretrain the STVSR network on a task aligned with its core objectives of spatial restoration and cross-frame aggregation. The model learns to reconstruct clean, higher-spatio-temporal-resolution outputs from degraded, pseudo-temporal inputs, with a modulation factor encouraging greater focus on difficult regions. Extensive experiments show that our simple pretraining significantly improves STVSR performance and outperforms existing video representation learning approaches. We note our method is effective even when pretraining and finetuning with a limited quantity of data.
Paperid: 2911,   Poster  
Authors: Xinyao Liao, QIYUAN HE, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao
Title: VA-$\boldsymbol{\pi}$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from groundtruth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-\boldsymbol\pi, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-\pi formulates the generator–tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-\pi introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling.The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency. VA-\pi enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744).
Paperid: 2912,   Poster  
Authors: Xu Wang, Zihan Lin, Yixin Zhang, Zilei Wang
Title: Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection
Abstract: Incremental Object Detection (IOD) aims to equip detectors with the ability to handle dynamic environments and emerging object categories, and the rise of visionlanguage models has substantially advanced this goal. However, existing studies often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate vision-language models under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. DGS enables vision-language models to effectively handle task streams of various distribution shifts. Extensive experiments across three benchmarks demonstrate that DGS achieves state-of-the-art performance, highlighting its robustness in diverse incremental learning scenarios.
Paperid: 2913,   Poster  
Authors: Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Xin Zhao, Min Yang, Ji-Rong Wen
Title: Improving Vision-language Models with Perception-centric Process Reward Models
Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of visionlanguage models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model’s response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released.
Paperid: 2914,   Poster  
Authors: Yue Xu, Chenyu Hu, Pengyu An, Yonglu Li
Title: Mitigating The Distribution Shift of Diffusion-based Dataset Distillation
Abstract: Dataset Distillation (DD) seeks to create small, synthetic datasets for efficient model training. While diffusion models are powerful generators, their use in DD is hampered by distribution shifts between synthetic and ideal distilled data, leading to suboptimal performance. We identify two critical shifts. First, considering the small capacity of the synthetic data, an optimal synthetic distribution for DD should be a simplification of the real data distribution, rather than replicating the original data's complexity. Second, there is a hazardous empirical deviation in the synthetic dataset from this learned distribution due to the data sampling process. To address these, we introduce a twostage approach. During diffusion training time, we mitigate the distribution shift by employing an L1 sparsity regularizer, compelling the diffusion model to learn a compact and semantically sparse manifold. Then, during sampling time, we abandon the flawed sequential sampling paradigm and instead synchronously denoises the entire synthetic dataset with distribution regularizers. This framework systematically mitigates both identified distribution shifts. Experiments show our method achieves state-of-the-art performance with superior computational efficiency.
Paperid: 2915,   Poster  
Authors: Delong Liu, Haotian Hou, Zhaohui Hou, Zhiyuan Huang, Shihao Han, Mingjie Zhan, Zhicheng Zhao, Fei Su
Title: Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
Abstract: Precise and controllable image editing remains a significant challenge. Current methods often rely on text prompts, but achieving accurate spatial localization solely through descriptions is inherently difficult. Maskbased approaches, though offering better control, typically require overly precise user annotations, thus increasing user burden and leading to unnatural results. To bridge this gap, we introduce the Interactive Instruction-based Image Editing (I_3E) task, which generates high-quality edits from a more intuitive combination: concise text instructions and imprecise spatial guidance. To address the critical lack of suitable data, we propose an efficient pipeline to generate Inter-Edit, a new million-scale training dataset that simulates realistic user masks---not strictly segment-aligned. We also present a comprehensive benchmark, featuring a meticulously human-annotated test set that captures diverse, localization-dependent editing scenarios and realistic user interaction patterns. To evaluate this task, we introduce a new suite of position-aware metrics that strongly correlate with human perceptual judgments. Finally, we develop three baseline models trained on Inter-Edit. Extensive experiments demonstrate that our methods significantly enhance I_3E performance, achieving substantial improvements in localization and edit quality, and outperforming existing state-of-the-art models. The Inter-Edit dataset and all related code will be made publicly available.
Paperid: 2916,   Poster  
Authors: Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon
Title: Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
Abstract: Diffusion models have become the dominant tool for highfidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79× speedup on FLUX.1 and 4.67× speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
Paperid: 2917,   Poster  
Authors: Jianshi Wu, Minghang Zhu, dq Liu, Wen Li, Sheng Ao, Siqi Shen, Chenglu Wen, Cheng Wang
Title: LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
Abstract: LiDAR relocalization has attracted increasing attention as it can deliver accurate 6DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we proposeLEADER, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that PerfectLoc outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. Source code will be released soon.
Paperid: 2918,   Poster  
Authors: Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong
Title: EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease
Abstract: Deep learning models for medical image analysis often act as “black boxes,” seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision–language framework that generates structured AD diagnostic reports with each claim explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence–Evidence–Anatomy (SEA) Grounding mechanism: (i) sentenceto-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning–diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods; we will release code and grounding annotations to support future research in trustworthy medical vision–language models.
Paperid: 2919,   Poster  
Authors: Xin Jiang, Hao Tang, Meiqi Cao, Junyao Gao, Fei Shen, Zechao Li
Title: DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
Abstract: Openset fine-grained retrieval~(OSFR) is a challenging task where models must generalize to unseen subcategories. Existing methods often fail this, as they embed category-specific semantics from closed-set training labels. Recently, diffusion transformers (DiT) have shown promise by encoding attribute-centric, generative curriculum knowledge that is agnostic to these labels. However, the vanilla DiT is not optimized for fine-grained visual discrepancies and its massive size makes deployment infeasible. To solve this, we propose DiT-Distill, a framework to first refine and then distill this knowledge. We introduce a conditional discrepancy refinement strategy to fine-tune the DiT, forcing it to focus on discrepancy-aware, attribute-centric details rather than holistic context. Subsequently, a generative curriculum distillation mechanism transfers this refined, hierarchical knowledge from multiple diffusion timesteps of the DiT into a lightweight backbone using a generative infusion module and a curriculum alignment loss. This process results in an efficient retrieval model that enables DiT-free inference. Extensive experiments show DiT-Distill achieves state-of-the-art performance on open-set fine-grained datasets.
Paperid: 2920,   Poster  
Authors: Dongyu Wang, Dar-Yen Chen, Yi-Zhe Song
Title: CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis
Abstract: Sketchbased caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognisable distortions. We identify the root cause as \emphcondition signal contamination -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: \mathcalP^\mathrmi (pure identity), \mathcalP^\mathrms (pure shape), and \mathcalP^\mathrmi+s (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers \mathcalP^\mathrmi+s toward optimal balance: \mathcalE\_\mathrmshape ensures sketch fidelity through layout and semantic alignment, while \mathcalE\_\mathrmid employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualises the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.
Paperid: 2921,   Poster  
Authors: Yang Wu, Zhaojiang Liu, Qiang Meng, Youquan Liu, renliang Weng, Jianjun Qian, Jian Yang, Jin Xie
Title: GEM: Generating LiDAR World Model via Deformable Mamba
Abstract: World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDARbased world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we proposeGEM: aGenerative LiDAR world model that leverages dEformableMamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and it's potential to generate "what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness.
Paperid: 2922,   Poster  
Authors: Yan Shen, Feng Jiang, Zichen He, Xiaoqi Li, Yuchen Liu, Zhiyu Li, Ruihai Wu, Hao Dong
Title: BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration
Abstract: Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other’s goaldirected action—for instance, pushing the iPad to the table’s edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm’s subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.
Paperid: 2923,   Poster  
Authors: Zeyu Hua, HUI LI, Yu Wang, Song Wang, Congchao Zhu, Caixia Zheng
Title: Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement
Abstract: LowLight Image Enhancement (LLIE) is a challenging task, as severe information loss means a single input can correspond to multiple plausible restorations. This inherent ambiguity causes conventional regression-based models to produce overly-smooth results that lack detail. While recent generative models can create richer details, their common unidirectional design often compromises content fidelity by distorting original structures. We introduce Bi-Bridge, a unified framework that models both enhancement and its inverse degradation within a single symmetric diffusion bridge. By compelling the network to preserve essential content structures across both transformations, this bidirectional learning acts as a powerful constraint, leading to significantly more faithful and realistic restorations. Extensive experiments show that Bi-Bridge outperforms state-of-the-art (SOTA) methods across multiple benchmarks, establishing a new standard for fidelity and perceptual quality.
Paperid: 2924,   Poster  
Authors: Joo Hyung OH, Minyoung Oh, Sung Whan Yoon, Jae-Young Sim
Title: Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
Abstract: Extending person reidentification (ReID) to a federated scenario has recently drawn attention due to privacy concerns of individuals, but existing methods mostly assume sufficient diversity in pose variations even within a decentralized client. We focus on a more realistic federated-by-camera scenario, where each client corresponds to a single camera and thus captures only a sparse set of poses. To enrich pose variety, we proposePose-guided Enriched Feature Learning (PEFL)that explicitly augments pose-diverse samples in the federated ReID scenario. Specifically, a Pose-Extraction Module (PEM) disentangles pose-relevant and pose-irrelevant feature components, where Pose-Relationship Knowledge Distillation (PKD) method helps identify the correct pose and Semantic Consistency Maintenance (SCM) method preserves semantics even with pose changes. In addition, a Compatibility Regularization method ensures the PEM to be compatible with the feature space of the global model. By recombining pose-relevant and -irrelevant components across identities via PEM, our PEFL synthesizes pose-swapped features, thereby largely facilitating contrastive learning of ReID models. Extensive experiments on Market1501 and MSMT17 under the federated-by-camera setting demonstrate that PEFL consistently outperforms federated ReID baselines and their conjunctions with the existing feature augmentation methods; thus achieving state-of-the-art federated ReID performance.
Paperid: 2925,   Poster  
Authors: Xinyang Wang, Kecheng Zheng, Minfeng Zhu, Wei Wu, Fan Lu, Wei Zhai, Wei Chen
Title: Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models
Abstract: Chainof-Thought (CoT) has recently shown encouraging progress in the vision language model. However, the pure-vision CoT (i.e., chain-of-vision) has been underexplored in visual in-context learning. In this paper, we introduce Diffusion Guided Chain-of-Vision, which integrates an explicit chain-of-thought process into autoregressive vision models through vision prior from pre-trained diffusion models. Concretely, we find that pre-trained diffusion models induce a reliable probability flow in image space, where intermediate images sampled along this flow exhibit visual coherence and serve as task-free, chain-of-vision supervision for pure-vision autoregressive models. Extensive experiments on diverse vision tasks and multi-scale models validate the effectiveness of our proposed method for visual in-context learning. Code and dataset will be publicly available.
Paperid: 2926,   Poster  
Authors: Jin-Cheng Jhang, Fu-En Wang, Xin Yang, Nan Qiao, Lu Xia, Min Sun, Cheng-Hao Kuo
Title: Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
Abstract: Visual grounding aims to associate freeform textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding—an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more flexible alternative to conventional grounding representations. Our method leverages the attention mechanisms inherently present in MLLMs. By synthesizing text-conditioned, grounding-aware queries within intermediate layers via the proposed Q-Synth Module, we extract target-relevant attention patterns and refine them using a lightweight Attention-to-Point Decoder that converts these patterns into a point-centric heatmap for final prediction. Notably, all original MLLM parameters are frozen, ensuring full preservation of their pre-trained capabilities. Experiments show that our design consistently improves part-level grounding accuracy across datasets and can be seamlessly integrated into any open-source MLLMs.
Paperid: 2927,   Poster  
Authors: Longzhao Guo, shuo zhang, Chen Gao, Qian Tian, Youfang Lin
Title: LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising
Abstract: Recent advances in learningbased Light Field (LF) image denoising have achieved impressive results. However, these methods rely heavily on large-scale noisy-clean image pairs and often fail to generalize to unseen or complex noise.In this work, we observe that the inherent multi-view consistency of LF images makes it highly unlikely for noise to be coherent across views, offering a more reliable supervisory signal for self-supervised denoising.Building on this insight, we extend the blind-spot principle to the LF domain and propose a novel LF Blind-View denoising Network (LF-BVN). We first introduce a geometric invariance mask that leverages angular redundancy for efficient full-view supervision. To enforce cross-view photometric consistency, we further introduce latent representation volumes and enforce consistency between them.Additionally, we exploit focus stacks to extract latent depth cues from noisy observations, providing further guidance.Extensive experiments show that LF-BVN achieves competitive denoising performance while maintaining strong cross-view consistency without requiring clean data or external supervision.
Paperid: 2928,   Poster  
Authors: Ruiyu Mao, Baoming Zhang, Nicholas Ruozzi, Yunhui Guo
Title: Learnability-Driven Submodular Optimization for Active Roadside BEV Perception
Abstract: Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadsideonly data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.
Paperid: 2929,   Poster  
Authors: Chengjie Huang, GUILE WU, Dongfeng Bai, Bingbing Liu
Title: VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer
Abstract: Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of realtime applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrates geometry-grounded features of the entire video into global query features to optimize track information across the entire sequence and the frame-level branch combines geometry-grounded features of each respective frame into frame-level query features to refine fine-grained track coordinate predictions.Furthermore, to facilitate collaboration between the global branch and the frame-level branch, we introduce an interaction module which enables unidirectional or bidirectional information exchange between the global query features and frame-level query features.Extensive experiments on various point tracking benchmark datasets show that our approach achieves significantly fast spatial tracking speed compared with state-of-the-art methods, while maintaining comparable tracking accuracy.
Paperid: 2930,   Poster  
Authors: Yiyao Wang, Sixian Zhang, Keming Zhang, Xinhang Song, Songjie Du, Shuqiang Jiang
Title: TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
Abstract: Existing zeroshot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience.To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric–semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience.Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric–semantic experiences and improves zero-shot ObjectNav performance.
Paperid: 2931,   Poster  
Authors: Shigeng Xie, Hongming Xu, Guiyang Jiang, Tuomo Rossi, Tommi Kärkkäinen, Fengyu Cong
Title: Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
Abstract: In hematoxylineosin (H&E) to virtual immunohistochemistry (IHC) staining, paired images enable supervised learning but suffer from inherent spatial dislocation, limiting pixel-level constraints. Thus, auxiliary tasks have been increasingly employed with paired data to provide complementary supervision. However, existing methods largely overlook the rich semantic information embedded in auxiliary task models. This paper proposes a novel framework for virtual IHC staining guided by dual-aligned multi-task features, which fully explores semantic cues from auxiliary tasks. To realize effective guidance, we address two obstacles: (1) the spatial mismatch between paired H&E and IHC feature representations; (2) the task gap between auxiliary task features and virtual staining features. To resolve the spatial mismatch, we generate an alignment matrix that aligns H&E and IHC features. Specifically, we first introduce structure-enhanced learning to restore semantic consistency in regions affected by inaccurate staining in virtual IHC images. Then, we separately cluster features from virtual IHC and real IHC images, and establish semantic correspondences using an active-passive matching mechanism. This ensures that only semantically aligned regions are matched, reducing the impact of staining variability on the alignment matrix. To bridge the task gap, we introduce a task-gap alignment module trained under the principle that auxiliary features are considered aligned if they improve the performance of the virtual IHC staining model. Extensive experiments on two public datasets with four biomarkers demonstrate the effectiveness of our framework. Our code will be publicly available.
Paperid: 2932,   Poster  
Authors: Shiyi Zhang, YIJI CHENG, Tiankai Hang, Zijin Yin, Runze He, Yu Xu, Wenxun Dai, yunlong lin, Chunyu Wang, qinglin lu, Yansong Tang
Title: Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Abstract: Unified multimodal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet — (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability.(2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks.To further align the model's editing behavior with its CoT reasoning, we introduce the CoT–Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model will be released publicly.
Paperid: 2933,   Poster  
Authors: Haiwen Li, Zining Chen, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su
Title: Adapting In-context Generation for Enhanced Composed Image Retrieval
Abstract: As a challenge visionlanguage task, Composed Image Retrieval (CIR) aims to integrate information from a bi-modal query (image + text) to retrieve target images. While supervised CIR has achieved notable success in domain-specific scenarios, its reliance on manually annotated triplets restricts its scalability and application. Zero-shot CIR alleviates this by leveraging unlabeled data or automatically collected triplets, yet it often suffers from an intractable domain gap. To this end, we shift the focus to developing robust CIR models under limited labeled data and propose Domain-Adaptive In-context Generation (DAIG), which adapts the in-context capability of a pretrained Text-to-Image (T2I) model to the target domain and the CIR task using few-shot samples and then transforms the LLM-generated textual triplets into unbiased CIR triplets as additional training data. After that, we present a two-stage framework applicable to any supervised CIR approach. The first stage, Distributionally Robust Synthetic Pretraining (DRSP), perturbs visual features to expand the distribution of synthetic data and improve training robustness on it. The second stage, Fine-grained Real-world Adaptation (FRA), fine-tunes on manually annotated triplets by imposing an angular margin on matching pairs to facilitate fine-grained learning. Experiments on two benchmarks validate the effectiveness of our method, i.e., under both few-shot and fully supervised CIR settings, DAIG yields substantial performance gains over CLIP4CIR, BLIP4CIR, and SPRC. The code and data will be released as open source.
Paperid: 2934,   Poster  
Authors: Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng
Title: Bridging Domain Expertise and Generalization for Performance Estimation
Abstract: Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without groundtruth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.
Paperid: 2935,   Poster  
Authors: Xinyue Liu, Jin Liu, Hongbo Wang, Ran He, Huaibo Huang
Title: Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
Abstract: Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multiview inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress hallucinations and enhance cross-view consistency. To further promote consistency across views, we propose a Cross-view Semantic Appearance Alignment strategy that enhances multi-view consistency by establishing dynamic geometric associations between the same features from different viewpoints. Extensive experiments demonstrate that Thoughtful3D significantly improves the quality and consistency of generated 3D assets.
Paperid: 2936,   Poster  
Authors: Hongkun Pan, Yuwei Wu, Wanyi Hong, ShengHui Hu, Qitong Yan, Yi Yang, Rufei Han, Changju Zhou, Minfeng Zhu, Dongming Han, Wei Chen
Title: Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Abstract: Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited finegrained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model,Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we proposeFocus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduceFocus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth when more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we releaseHID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that our Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning.
Paperid: 2937,   Poster  
Authors: Bin Liu, Bin Liu, Qianqian Wang, Wei Feng, yijiechen yijiechen, Haixi Zhang
Title: Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation
Abstract: Mitigating noisy correspondence in crossmodel matching poses a serious challenge due to the problem of error accumulation. Existing methods primarily attribute this accumulation to errors caused by noisy sample pairs. However, a novel source of error from clean sample pairs (also termed anchor pairs) is discovered in this paper. Such error accumulation is considered to arise from modality-inconsistent correlations. To address this issue, a novel method termed Geometric-Semantic Learning (GSL) is proposed. Firstly, GSL leverages the Fourier transform to emphasize semantic representations and reduce cross-modal inconsistencies caused by perturbations in non-critical fine-grained features, thereby alleviating the error accumulation problem. After that, a Geometry-Aware Label Correction (GALC) method is introduced to re-estimate soft correspondence labels by leveraging angular consistency between noisy sample pairs and anchor pairs across different modalities. Finally, a semantically constrained triplet loss is employed to regulate sample distances using semantic information, enabling robust separation of clean and noisy pairs during the training process. Extensive experiments on three benchmark datasets demonstrate that GSL consistently outperforms existing methods in retrieval accuracy.
Paperid: 2938,   Poster  
Authors: Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang
Title: GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
Abstract: Despite significant progress in VisionLanguage Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) —a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM)–based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues—explicit depth-based projection and implicit learned priors—yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.
Paperid: 2939,   Poster  
Authors: Jiangang Ding, Yiquan Du, Pengxiang Li, Lili Pei, Yuanlin Zhao, Wei Li
Title: SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
Abstract: Bridging the modality gap between infrared and visible imagery is critical for crossmodal understanding and for enriching multimodal benchmarks. However, existing approaches remain confined to one-to-one mappings and are typically evaluated on unidirectional or closed-set scenarios. To address this challenge, we present SynthRGB-T, a unified framework for diverse and bidirectional image translation. Specifically, we formulate image translation as a vision-language guided denoising diffusion process, enabling flexible conditioning and open-world generalization. To enhance semantic alignment, a Visual Grounding Pipeline (VGP) is introduced to exploit the world knowledge of foundation models for fine-grained translation guidance. During the diffusion process, we propose to adopt a decoupling injection strategy to alleviate interference among multiple guidance. In addition, a Dual Conditional Cross-Attention (DCCA) module is designed to facilitate collaborative representation learning in latent space. SynthRGB-T is simple and versatile—capable of synthesizing diverse, high-fidelity data that substantially extends multimodal resources within the community. Comprehensive evaluations on multiple real-world benchmarks confirm that SynthRGB-T delivers superior performance and enhanced visual diversity over existing approaches. All code, models, and large-scale synthetic datasets will be released upon camera ready version.
Paperid: 2940,   Poster  
Authors: Wonhyeok Choi, Kyumin Hwang, Jihun Park, Kyoungmin Lee, Seunghun Lee, Jaeyeul Kim, Minwoo Choi, Sunghoon Im
Title: TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
Abstract: Multitask learning (MTL) involves the simultaneous optimization of multiple task-specific losses, often leading to gradient conflicts and scale imbalances that result in negative transfer. While existing multi-task optimization methods attempt to mitigate these challenges, they either lack the stochasticity needed to escape poor local minima or fail to explicitly address conflicts at the gradient level. In this work, we propose TaskForce, a novel multi-task optimization framework incorporating cooperative multi-agent reinforcement learning (MARL), where agents learn to find an effective joint optimization strategy based on their respective task gradients and losses. To keep the optimization process compact yet informative, agents observe a summary of the training dynamics that consists of the gradient Gram matrix---capturing both gradient magnitudes and pairwise alignments---and task loss values. Each agent then predicts the balancing parameters that determine the weight of their contribution to the final gradient update. Crucially, we design a hybrid reward function that incorporates both gradient-based signals and loss improvement dynamics, enabling agents to effectively resolve gradient conflicts and avoid poor convergence by considering both direct gradient information and the resulting impact on loss reduction. TaskForce achieves consistent improvements over state-of-the-art MTL baselines on NYU-v2, Cityscapes, and QM9, demonstrating the promise of cooperative MARL in complex multi-task scenarios.
Paperid: 2941,   Poster  
Authors: Tengyu Ma, Zhilong Dai, Yubo Diao, Guanming An, Long Ma, Jinyuan Liu, Risheng Liu
Title: Taming Generative Diffusion Model for Task-Oriented Infrared Imaging
Abstract: Infrared (IR) imaging is indispensable for perception in adverse environments, yet realworld data is often corrupted by dynamically coupled degradations that impair both visual quality and downstream semantic understanding. Although diffusion models offer powerful generative priors, existing approaches remain ill-suited to this setting. Their slow multi-step sampling, reliance on RGB-driven statistics misaligned with IR physics, and the necessity for costly fine-tuning of all model parameters render them impractical for dynamic IR perception. We present a unified diffusion framework that re-formulates IR restoration as a single-step generative process. The core idea is to associate each degraded input with a specific intermediate latent state in the diffusion trajectory, enabling the model to reconstruct the clean image via a single, direct reverse step. Physical realism is further reinforced through an IR-specific spectral regularization that preserves the characteristic energy distribution of thermal emissions. Addressing the diverse and rapidly shifting demands of dynamic IR perception, we further develop a task-aware low-rank adaptation mechanism. This mechanism employs a lightweight prompting hypernetwork to generate compact modulation parameters, facilitating rapid and scalable adaptation ability without retraining the entire network. Comprehensive evaluations demonstrate that our framework attains state-of-the-art restoration performance, preserves reliable semantic structures, and supports rapid adaptation that generalizes effectively across diverse tasks and conditions.
Paperid: 2942,   Poster  
Authors: runsen liu, Aizemaitijiang Baoerhan, Zhangyu Wang, Jie Wang, Jinghao Cui, GuizhenYu GuizhenYu, Songyue Yang, WanCheng Sun, Mingjun Tang, Zhanbo Hua, Wenwen Luo
Title: URScenes: A Multi-scenario Dataset for Unstructured Road Environments
Abstract: As autonomous driving technology transitions from smallscale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, glare, night, cloudy, and sunny conditions. Additionally, URScenes supports multi-task perception for 3D object detection, multi-object tracking, and 3D occupancy in unstructured road environments. URScenes also provides a unified annotation system and format conversion tools, enabling easy conversion to popular formats such as NuScenes, KITTI, and Waymo dataset. Finally, this study presents comparative experimental results to assess the performance of state-of-the-art algorithms on the URScenes dataset. The data, development toolkit, and additional information are available online.
Paperid: 2943,   Poster  
Authors: Jiachen Tu, Guanghui Qin, Theodore Zhao, Jeya Maria Jose Valanarasu, Sheng Zhang, Tristan Naumann, Fan Lam, Sheng Wang, Hoifung Poon
Title: Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
Abstract: Effective medical image analysis requires representations that capture both global anatomical structure and finegrained tissue texture. Current self-supervised approaches exhibit limited capacity to address both requirements simultaneously. Invariance-based methods learn through augmentation consistency but face challenges in medical imaging where common augmentations may discard diagnostically relevant intensity patterns. Masked image modeling approaches employ high masking ratios to enforce holistic reasoning, yet inherently limit exposure to fine-grained texture. Recent work in general-domain vision demonstrates that generative and semantic objectives can mutually benefit each other, yet this paradigm remains unexplored for 3D medical imaging. We introduce Masked-Diffusion Autoencoders (MDAE), a self-supervised framework that imposes concurrent spatial masking and diffusion corruption, encouraging the model to learn complementary objectives: masked region reconstruction for structural coherence and visible region denoising for textural characteristics. This dual corruption enables the network to learn structure-texture representations within a unified time-conditioned objective. Evaluated on brain MRI across tumor classification, molecular marker detection, and dense segmentation benchmarks, MDAE consistently outperforms state-of-the-art baselines, with improvements most pronounced in cross-modal generalization tasks.
Paperid: 2944,   Poster  
Authors: Junlin Xie, Quanlong Zheng, Ruifei Zhang, Kuo Wang, Yanhao Zhang, Jinguo Luo, Haonan Lu, Xiang Wan, Guanbin Li
Title: StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation
Abstract: The transition of RetrievalAugmented Generation (RAG) from offline video analysis to online, streaming scenarios presents a set of critical, unexplored challenges. These include the need for on-the-fly semantic segmentation of continuous video, the inherent tension between low-latency processing and high-quality knowledge extraction, and the demand for query-specific temporal reasoning. We propose StreamRAG, a novel framework designed to overcome these hurdles. StreamRAG is built upon three core technical pillars: (1) a Stream Event Segmentation (SES) module that performs real-time boundary detection to chunk the stream into meaningful units; (2) a Token-Reusing Accelerator that drastically cuts down captioning latency by leveraging computational overlap between consecutive frames; and (3) a Dynamic Retrieval Gate that modulates the retrieval scope and strategy based on the query's temporal sensitivity and contextual similarity. Empirical evaluation confirms that StreamRAG establishes a new state-of-the-art, delivering superior accuracy with minimal latency in streaming video comprehension.
Paperid: 2945,   Poster  
Authors: Chuanjin Fan, Lifan Wu, Wenjie Chang, Hanzhi Chang, Wenfei Yang, Tianzhu Zhang
Title: EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation
Abstract: Reconstructing surface meshes from multiview images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose EMR-SM, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, EMR-SM is the first framework that directly optimizes meshes with real-time adaptive topology refinement. Extensive experiments demonstrate that EMR-SM achieves a balance among accuracy, computational efficiency, and mesh conciseness.
Paperid: 2946,   Poster  
Authors: Gong Chen, Chaokun Zhang, Pengcheng Lyu
Title: CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
Abstract: Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in realworld scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
Paperid: 2947,   Poster  
Authors: Yinghao Chen, Yeying Jin, Xiang Chen, Yanyan Wei, Ziyang Yan, Yaowen Fu
Title: Unpaired Deep Image Deraining Using Reward-Guided Self-Reinforcement Learning
Abstract: Unsupervised deraining has attracted increasing attention due to its flexible data requirements during model training. Lacking paired supervision makes it challenging for the network to achieve a compact optimization space within complex and diversity rain degradation data. Additionally, some highquality deraining results produced during the network’s training process are overlooked, despite their potential to constrain the optimization space. To overcome them, we introduce a Reward-Guided Self-reinforcement Unsupervised Image Deraining framework, RGSUD. Our RGSUD consists of two stages: rewards recycling and self-reinforcement (SR) strategy training. For the former, we propose a Vision Language Model (VLM) based dynamic reward recycling mechanism to select the optimal deraining results from outputs during model training. In this way, we can robustly collect high-quality deraining results. For the latter, reward-driven optimization is adopted to construct the connection between the rewards and current deraining result, which constrains the optimization space of RGSUD. Thus, the network can learn deraining knowledge within a more compact optimization space, further enhancing deraining performance. The proposed SR strategy achieves over 1 dB improvement on Rain100L and real-world dataset RealRain1K-L, compared to the baseline. Extensive experiments on multiple datasets demonstrate that our proposed framework performs favorably over state-of-the-art unsupervised deraining methods.
Paperid: 2948,   Poster  
Authors: jusheng zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lyu, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang
Title: HTC-VLM: Disentangled Hybrid Token Compression for Vision-Language Models
Abstract: Visionlanguage models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens to LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics like object identities, while discrete quantization loses granular details such as textures. We challenge this by introducingHTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed to one token via a disentanglement attention mask and abottleneck, ensuring efficient, grounded representations.HTC-VLM achieves an average performance retention of87.2%across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at81.0%with a 580-to-1 compression ratio. Attention analyses show the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid can resolve the efficiency–fidelity dilemma, advancing scalable VLMs.
Paperid: 2949,   Poster  
Authors: Ouyangzi Ye, Feifei Shao, Kexin Li, Yawei Luo, Zikai Song, Ping Liu, Fengda Zhang, Hongwei Wang, Jun Xiao
Title: Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
Abstract: Testtime training (TTT) enhances the robustness of pretrained models to out-of-distribution (OOD) data through auxiliary self-supervised tasks, without requiring labeled samples. However, existing TTT methods predominantly rely on decoder-based auxiliary objectives, which suffer from inefficient adaptation and weak coupling with the primary task. To solve these limitations, we revisit the mechanism of test-time training by analyzing masking-based pretrained models to uncover the fundamental source of their OOD robustness. Our investigation reveals that their generalization capability stems from a latent feature-level structural invariance, the consistency of encoded representations under masked perturbations. Building on this insight, we introduce LoTT-PC, a lightweight LoRA-based framework that operationalizes this invariance-preserving principle for 3D point cloud classification. LoTT-PC consists of two main components: (1) low-rank modulation units for parameter-efficient adaptation, and (2) a permutation-invariant alignment mechanism that enforces representation consistency through masked feature alignment. Extensive experiments on multiple benchmarks demonstrate that this unified design enables pretrained point cloud models to self-tune rapidly and reliably across diverse OOD scenarios, outperforming state-of-the-art methods by an average of 2.7% in accuracy under various corruption types.
Paperid: 2950,   Poster  
Authors: XingYu Yang, Yu Zhang, Siya Mi, Xiu-Shen Wei
Title: TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
Abstract: For an unlabelled dataset containing known and unknown categories, Generalized Category Discovery (GCD) aims to classify the known categories exactly while simultaneously discovering the unknown categories. Current GCD methods have achieved significant progress on coarsegrained datasets but still struggle to generalize to fine-grained scenarios. We observe that attention artifacts, a phenomenon where the attention map exhibits abnormally high responses concentrated on a few tokens, significantly interferes with fine-grained GCD. In this paper, we argue that attention artifacts compel the model to overemphasize global semantics, consequently overlooking fine-grained local cues that are crucial for category discrimination. We propose the Token-Aware Refinement (TAR) framework, which introduces a plug-and-play module to mitigate the impact of attention artifacts and enhances the concentration of local information. TAR departs from the conventional classification paradigm that relies solely on the first token as input to the classifier. Instead, it fully exploits the entire token sequence, thereby significantly enhancing the model's focus on fine-grained local information. Extensive experiments demonstrate the superior performance of TAR across various benchmarks.
Paperid: 2951,   Poster  
Authors: Souymodip Chakraborty, Ankur Singh, Amit Vikram Singh, Vineet Batra, Ankit Phogat
Title: ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
Abstract: We present ShapeAR, a novel autoregressive latent diffusion framework that decomposes raster images into editable, artistlike vector shape layers. Unlike conventional raster-to-SVG methods that rely on boundary tracing or joint path optimization, ShapeAR generates non-overlapping RGBA shape layers directly in latent space via flow-matching diffusion. To scale generation to complex scenes with many shapes, we formulate the process autoregressively, conditioning each step on both the input image (global context) and the partial composition of previously generated layers (local context). In addition, we propose geometry-aware evaluation metrics that quantify the aesthetic and structural quality of the generated shapes, enabling more rigorous assessment beyond pixel-level reconstruction. ShapeAR achieves cleaner decompositions and more coherent vector layers.
Paperid: 2952,   Poster  
Authors: Hao jiacheng, Chunying Liu, Hao Guo, RuohanWang ruohan, Hongping Gan, Yilei Shi
Title: PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
Abstract: In industrial settings, classification of 3D CAD models are critical for efficient manufacturing. However, the limited availability of annotated CAD models presents an obstacle to achieving rapid adaptation in fewshot part classification scenarios. In this paper, we propose a hybrid graph representation and a pre-training and graph prompt framework for B-rep few-shot classification. Specifically, hybrid graph representation captures comprehensive and multi-level structural information of B-rep models by constructing local topology graph, global parallel graph and regional association hypergraph. A hierarchical graph network then fuses component-level structures with topological details in the hybrid graph. Reinforcement-augmented contrastive pre-training produces robust universal representations while in-place perturbation reduces training time. Structure-aware graph prompts finally produce node-specific cues, enabling few-shot B-rep part classification without heavy fine-tuning. Experiments on the TraceParts-11and FabWave-31 datasets show that our method outperforms existing general-purpose approaches. This work provides an efficient and state-of-the-art solution for few-shot B-rep part classification.
Paperid: 2953,   Poster  
Authors: Jiashi Lin, Changhong Jiang, Xiangru Lin, Ruifei Zhang, Xinyi Zhu, Jiyao Liu, Cheng Tang, Ye Du, Shujian Gao, Junzhi Ning, Lihao Liu, Ziyan Huang, Tianbin Li, Jin Ye, Junjun He
Title: EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval
Abstract: Retrievalaugmented generation (RAG) has emerged as a critical paradigm for grounding Multimodal Large Language Models (MLLMs) in external knowledge. Recent GraphRAG methods introduce structured entity-relation graphs to improve retrieval and reasoning. However, they remain limited by treating knowledge graphs as static data structures built offline and queried in a single pass. This static paradigm misaligns with the interactive, iterative nature of knowledge-intensive reasoning, creating three bottlenecks: (i) text-centric fragmentation that impedes cross-modal reasoning, (ii) frozen structures unable to incorporate new evidence or correct errors, and (iii) rigid single-pass retrieval without adaptive refinement. To overcome these limitations, we introduce EvoGraph-R1, a self-evolving GraphRAG framework that reconceptualizes knowledge graphs as dynamic environments shaped through agent interactions.We formulate retrieval as a Markov Decision Process (MDP) where the agent observes the graph state and executes actions to query (GraphRetrieve), expand (WebSearch), refine (GraphEdit), or terminate (Answer) the reasoning. These actions reshape the hypergraph structure and generate feedback signals that guide subsequent evolution.Through this closed loop, the hypergraph evolves by integrating new evidence, correcting errors, and refining structure to support multi-hop reasoning. Experiments on multimodal VQA and text QA benchmarks demonstrate substantial improvements over existing RAG baselines in accuracy, coverage, and traceability, establishing self-evolving knowledge graphs as a fundamental paradigm across modalities.
Paperid: 2954,   Poster  
Authors: Dongqian Guo, Haoran Wei, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Title: DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
Abstract: Autonomous driving has made substantial progress recently, achieving reliable performance in most realworld environments. However, existing algorithms still depend heavily on high-definition maps, making them ineffective in mapless scenarios such as indoor parking lots. These limitations hinder seamless point-to-point navigation and restrict the broader deployment of the autonomous driving system.To address this challenge, we propose DriveVLN, a new task that extends Vision-and-Language Navigation (VLN) to autonomous driving. DriveVLN employs visual and linguistic priors to guide vehicles toward destinations based solely on concise natural-language descriptions, without access to predefined maps or routes. Unlike conventional VLN, which relies on detailed step-wise instructions in indoor environments, DriveVLN requires models to produce navigation information based on diverse visual cues and history, including signs, landmarks, and textual indicators.We further develop a CARLA-based simulation engine comprising over 200 realistic scenes reconstructed from real road scans, enabling large-scale training and closed-loop evaluation. A baseline model is established through supervised fine-tuning on real data, followed by reinforcement learning in simulation.Comprehensive experiments show that DriveVLN effectively bridges map-based and mapless driving, providing a new foundation for unified, language-driven autonomous navigation in complex real-world environments.
Paperid: 2955,   Poster  
Authors: Yang Fu, Yuliang Zou, Hao Xiang, Xin Huang, Yijing Bai, Chen Song, Weijing Shi, Govind Thattai, Dragomir Anguelov, Mingxing Tan, Yingwei Li
Title: Scene Reconstruction as Mapping Priors for 3D Detection
Abstract: In autonomous driving, mapping is critical for motion planning but remains an underutilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise — issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D detection (MPA3D) framework to effectively integrate the mapping priors with the distinct modalities of sensor data. Our extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, and proving the effectiveness of using scalable, reconstructed scene priors to enhance 3D detection.
Paperid: 2956,   Poster  
Authors: Zhi-Wei SHI, Yu-Bang Zheng, Heng-Chao Li
Title: Self-Attention Driven Tensor Representation for High-Order Data Recovery
Abstract: Lowrank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result, the proposed SADTR can achieve a more accurate low-rank representation. In theory, we provide a detailed analysis to demonstrate the recoverability of SADTR. To validate the effectiveness of SADTR, we apply it to three representative high-order data recovery tasks. Experimental results demonstrate that SADTR consistently outperforms existing state-of-the-art LRTR methods.
Paperid: 2957,   Poster  
Authors: Xicheng Gong, Qiwei Li, Peiran Xu, Yadong Mu
Title: Extending Embodied Question Answering from Perception to Decision
Abstract: Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified largescale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question–answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.
Paperid: 2958,   Poster  
Authors: Siyu Luan, Yan Li, Zhong Chen, Zhenyi Wang
Title: Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
Abstract: Nontransferable learning (NTL) aims to enforce usage restrictions by limiting a model’s generalization on target-domain data while maintaining its utility on the source domain. Current approaches face three major challenges: (1) low training efficiency due to retraining of the backbone network, (2) low inference efficiency, and (3) a rigid reliance on a shared, non-adaptive backbone network spanning both source and target domains. This shared setup, which aims to maximize source-domain performance and minimize target-domain performance, often introduces optimization conflicts due to overlapping class categories across source and target domains.In this paper, we propose a novel and efficient NTL approach using a dynamic Early-Exit Network, named ENL-DEE, which leverages Bayesian theory and dynamic neural networks to address these limitations. Our custom loss function guides source-domain data to exit at later stages of the network, maximizing model utility, while target-domain data exits earlier with non-semantic features, ensuring limited transferability. ENL-DEE offers three key advantages: (1) it enhances training efficiency by optimizing only the parameters of dynamic exit classifiers, bypassing the need to retrain the backbone; (2) it improves inference efficiency as data exits at various exit classifiers in the network; and (3) it resolves optimization conflicts by using distinct parameter sets for source and target domains, achieving higher performance on the source domain and lower performance on the target domain, thereby strengthening NTL. Extensive experiments across diverse datasets and model architectures validate the scalability, efficiency, and effectiveness of our approach.
Paperid: 2959,   Poster  
Authors: Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song
Title: Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking
Abstract: Given the realtime demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: a spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and a prediction-level distillation that enhances spatial localization by learning the teacher’s capability of accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher’s target modeling capacity to the student. While the asymmetric architecture improves efficiency, it limits temporal adaptability. To mitigate this, a temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed, with EATrack-DeiT improving average success rate by 1.2% over the previous SOTA while running at 241.9 FPS on GPU.
Paperid: 2960,   Poster  
Authors: YiZhou Li, Jinyi Xu, Mingyu Yin, Xianyi Zhao
Title: Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion
Abstract: Vision Transformers (ViTs) have achieved remarkable progress in visual and multimodal tasks, yet their deployment remains costly.Tokenadaptive methods reduce FLOPs through dynamic depth computation, but face two limitations:(1) Global attention overemphasizes highly similar foreground regions, causing token-adaptive modules to assign the deepest computation to semantically weak foreground tokens while prematurely exiting edge tokens rich in structural cues;(2) Although token-adaption lowers FLOPs, it still relies on large parameter sets, and deep-layer weights remain underutilized due to early token exit. Parameter sharing could address redundancy but is difficult to apply in ViTs, where hierarchical abstraction typically requires diverse transformations.To address these issues, we propose Edge-RecViT, an Edge-Adaptive Dynamic Recursive Vision Transformer that integrates an edge-aware token-adaptive ranker with a recursive transformer using fully shared parameters in its hidden layers.Edge-RecViT dynamically allocates computation based on semantic richness: structurally informative edge tokens receive deeper refinement, whereas redundant low-information tokens exit early.Extensive experiments show that Edge-RecViT provides an excellent trade-off among accuracy, FLOPs, and parameter efficiency.On ImageNet-1K, it matches DeiT within 0.3% Top-1 accuracy while reducing FLOPs by 30.5% (35.1 → 24.39 GFLOPs).At the Base level, parameter drops from 86M to 23.21M with higher accuracy than ViT-Base; compared with ViT-Large, parameters are reduced by 93% while maintaining superior accuracy.
Paperid: 2961,   Poster  
Authors: Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Cheng Chi, Yuheng Ji, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Title: General Process Reward Modeling for Robotic Reinforcement Learning
Abstract: The primary obstacle for applying reinforcement learning (RL) to realworld robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization.To address these, we introduce Robo-Dopamine, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Robo-Dopamine, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap.Extensive experiments across 10 simulated and 8 real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency.For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks.
Paperid: 2962,   Poster  
Authors: Chen Xiaodong, Qian Bao, Xudong Liu, Jianping Fang, Jintao Fang, Yongdong Zhang, Tao Mei, Wu Liu
Title: Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment
Abstract: Although progress has been made in LLMbased text-driven motion generation, it still has the limitations of generating fine-grained and semantically consistent motions. These limitations stem from: 1) fine-grained motion quantization errors; 2) mismatches between causal reasoning language and non-causal motion representation; and 3) lack of human preference alignment. To solve them, this paper proposes MoTiGA, a multi-level causal LLM-based text-to-motion generation framework with human alignment. Firstly, MoTiGA employs Causal RVQ-VAE for multi-level causal fine-grained motion representation, then explores iterative residual quantization and causal convolutions to reduce fine-grained motion quantization errors, while preserving the causality as language presentation. Furthermore, the framework incorporates a time-lagged causal prediction strategy, enabling parallel prediction across motion token levels while maintaining temporal dependencies. Finally, to enhance human alignment, we propose Multi-level Hybrid-weighted Preference Optimization (MHPO), which dynamically adjusts semantic similarity weighting and continuous similarity scores. For MHPO, we also release the HumanML3D-R dataset, the first large-scale preference dataset for motion generation, with 101,490 human preference pairs. Evaluations show MoTiGA's superior performance, with an 82.3% FID improvement on HumanML3D and a 64.7% improvement on KIT-ML over other LLM-based methods.
Paperid: 2963,   Poster  
Authors: Andranik Sargsyan, Shant Navasardyan
Title: FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
Abstract: Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical analysis. In recent years, Dichotomous Image Segmentation (DIS) has become the standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve finegrained details or fully capture the semantic structure of the foreground.To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built upon the flow matching framework, which learns a time-dependent vector field to transport the image distribution into the corresponding mask distribution under optional textual guidance.Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through textual prompts, enabling precise, pixel-level object segmentation.Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared to the second best DIS method, FlowDIS achieves 5.5% higher F_\beta^\omega measure and 43% better MAE (\mathcalM) on DIS-TE test set.The code will be released upon publication.
Paperid: 2964,   Poster  
Authors: Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, Or Litany
Title: Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
Abstract: We often aim to generate images that are both photorealistic and 3Dconsistent, adhering to precise geometry, material, and viewpoint controls.Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available.While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images.To address this, we introduce Realiz3D, a lightweight framework that decouples controls and visual domain.The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain.In this way, the model can be guided to produce realistic images even when controls are applied.We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap.We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.
Paperid: 2965,   Poster  
Authors: Wenjie Zhang, Chen Yang, Xin Lu, Zhen Wang, Yue Liu, Bobo Xi, Pengbo Zhang
Title: SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment
Abstract: Semantic route planning involves generating itineraries that align with user intent while respecting realworld spatial constraints. However, text-only large language models (LLMs) often hallucinate geographically implausible routes due to poor spatial grounding. Inspired by how humans use maps for route planning, we propose the SMAP, which is the first multimodal framework combining user queries, POI metadata, and map tiles to produce spatially coherent, preference-aware routes. To enhance the spatial consistency, the SMAP features a two-stage anti-hallucination mechanism: (1) a map-grounded self-editing pipeline where a multimodal LLM (MLLM) drafts routes and a second MLLM verifies and refines them using geographic evidence; and (2) hallucination-penalized Direct Preference Optimization (HDPO) that steers the route generator toward spatially plausible routes by using verified routes as accepted responses and hallucinated drafts as rejected ones. Additionally, we introduce MM-Route, the first multimodal dataset for semantic route planning, with 3,000 diverse queries annotated with POI metadata and map tiles, covering a broad spectrum of geographic granularities and user intents. Experimental results demonstrate that SMAP significantly reduces geographical hallucinations and outperforms strong baselines in spatial plausibility and user alignment. The code and dataset will be made publicly available.
Paperid: 2966,   Poster  
Authors: Yusong Wang, Zheyuan Gu, MAO KEYU, Minghao Shao, Mingkun Xu, Prayag Tiwari, Jiawei Shao, qingsong zhao
Title: Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
Abstract: Crime anticipation enables proactive public safety interventions, yet existing video security systems remain largely reactive, unable to detect precursors of crime. While current visual language models (VLM)based video understanding methods show promise in high-level reasoning, they are not designed to explicitly model the spatio-temporal causal relationships essential for anticipating crimes.We address this limitation by two causal-driven components. First, we develop the Spatio-Temporal Causal Reasoning Crime (STCRC) dataset, a hierarchical dataset comprising 73K samples across five progressive causal reasoning tasks, facilitating criminal precursors learning. Second, we propose the Spatio-Temporal Causal Hypergraph (STCH), a streaming module that transforms implicit entity dynamics into explicit causal structures to enhance causal reasoning for crime in VLMs. By combining these two components, our framework advances real-time crime anticipation, achieving improvements in anticipatory tasks: a 70.7% relative improvement in crime classification, a 10.1% in crime detection, and a 3.7% reduction in temporal prediction error.
Paperid: 2967,   Poster  
Authors: Mengmeng Liu, Jiuming Liu, Michael Ying Yang, Chaokang Jiang, Jiangtao Li, Yunpeng Zhang, Hesheng Wang, Francesco Nex, Hao Cheng
Title: StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
Abstract: We propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatiotemporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly used autonomous driving datasets, reducing errors by 19% (t_\textrel) and 22% (r_\textrel) on KITTI, and by 18% ATE and 16% RPE on Argoverse, while remaining suitable for real-time deployment.
Paperid: 2968,   Poster  
Authors: Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang
Title: Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Abstract: We tackle the challenging task of generating complete 3D facial animations for two interacting, colocated participants from a mixed audio stream. While existing methods often produce disembodied ``talking heads'' akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship—including relative position, orientation, and mutual gaze—that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms are designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.
Paperid: 2969,   Poster  
Authors: Yinuo Jing, Jinyan Wu, Zixi Yang, Kongming Liang, Xiatian Zhu, Zhanyu Ma
Title: EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding
Abstract: Visionlanguage models (VLMs) have achieved remarkable success across numerous domains, yet they lag significantly in animal behavior understanding due to severe data scarcity. Annotated animal behavior videos are prohibitively expensive and time-consuming to collect, requiring domain expertise and controlled observation conditions. To address this challenge, we leverage structured domain knowledge as an inductive bias from the Neuro Behavior Ontology (NBO), which provides professional annotations, hierarchical behavior structures, and comprehensive semantic coverage. We present EthoCLIP, an ontology-enhanced vision–language contrastive learning framework that explicitly embeds ontology semantics through an ontology-aware graph module to capture hierarchical relationships among behaviors and learn structured semantic dependencies. Incorporating ontological information reduces the burden of learning purely from data, thereby alleviating requirements for large-scale datasets. To enhance EthoCLIP training, we construct AnimalBand, an NBO-consistent dataset integrating 74,671 videos across multiple species and behaviors with semantic standardization and extended knowledge coverage. Extensive experiments validate both our method and dataset. Results demonstrate that EthoCLIP pretrained on AnimalBand substantially improves behavior recognition accuracy and transfer learning performance across diverse benchmarks, confirming that ontology-driven semantic enrichment effectively addresses data scarcity in animal behavior understanding.
Paperid: 2970,   Poster  
Authors: Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao
Title: Chain-of-Models Pre-training: Rethinking Training Acceleration of CLIP Models
Abstract: In this paper, we present Chainof-Models Pre-training (CoM-PT), a novel performance-lossless training acceleration method for vision transformer models. This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, which scales efficiently with the family size. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of parameter size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance comparable to standard individual training while significantly reducing the training cost. Thanks to the property of model chain, we empirically find two compelling phenomena: i) adding smaller models can even decrease the total training cost, and ii) adding medium-sized models incurs only marginal additional training cost. In light of this, our CoM-PT first unlocks the pre-training efficiency that scales favorably with family size, providing large deployment flexibility across various devices. We plan to open-source the code and encourage the community to extend it to more pre-training paradigms.
Paperid: 2971,   Poster  
Authors: Chenfeng Yin, De Cheng, Wenlong Luo, Mingyue Zeng, Shizhou Zhang, Nannan Wang, Xinbo Gao
Title: Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation
Abstract: Incremental Object Detection (IOD) enables AI systems to continuously acquire new object classes while preserving knowledge of previously learned ones, an ability essential for deployment in dynamic, realworld environments. Existing IOD methods typically rely on knowledge distillation to mitigate catastrophic forgetting. However, the tight coupling between the student model’s detection head and backbone causes distillation gradients to conflict with new-class supervision at the head, injecting head-specific bias into the backbone and ultimately weakening distillation effectiveness. To address this issue, we propose a decoupled training mechanism for the model’s backbone and classification head. Specifically, we introduce the Future-aware Cross-head Distillation (FaCHD) method, which utilizes two frozen complementary teachers (historical and intermediate teachers) to decode the student’s ROI features for cross-head distillation. This strategy implicitly alleviates prediction conflicts caused by detection-head bias and provides richer task-relevant guidance, thereby improving distillation efficiency. To further address the detection head bias and model recency problem, we propose a Prototype Semantic Drift Compensation module, which recalibrates multi-granularity prototypes of old classes, effectively correcting semantic drift and enhancing the stability of the detection head. Extensive experiments on two standard IOD benchmarks demonstrate the effectiveness and superiority of the proposed method. Code is available in the supplementary materials.
Paperid: 2972,   Poster  
Authors: Li Yang, Boyu Cai, Wei Liu, Yan Wang, Chunfeng Yuan, Bing Li, Weiming Hu
Title: SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names
Abstract: Openvocabulary object detection (OVD) aims to detect objects described by arbitrary text, but most existing methods operate at a coarse category level and struggle with fine-grained, attribute-sensitive queries. We address this from both model and data perspectives. We propose a Semantic-Retrieval-Augmented Detector (SRA-Det) that uses an attention-based module to retrieve multiple semantic facets from token-level text features, and a soft-min matching rule that behaves like a differentiable logical AND over these facets, ensuring that all key attributes are satisfied. In parallel, we introduce an automatic attribute-augmented data pipeline that uses an LLM to generate category-specific visual attributes and a dual CLIP-based similarity check to verify them at the instance level. With a Swin-T backbone, our approach achieves 54.9 mAP in the zero-shot setting on FG-OVD and 40.4 AP on LVIS, establishing strong fine-grained and general OVD performance.
Paperid: 2973,   Poster  
Authors: JINWON KO, Keunsoo Ko, Chang-Su Kim
Title: CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
Abstract: Referencebased color grading aims to reproduce the tonal mood and color harmony of a reference while preserving scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings --- over-shifting or inconsistently retaining colors --- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot --- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity.
Paperid: 2974,   Poster  
Authors: Huimin Li, Boxuan Hu, yulin zhang, Xiuzhuang Zhou, Junlin Hu
Title: Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
Abstract: Defect synthesis, as a core technology for addressing the problem of fewshot defect classification, has been widely adopted in industrial scenarios. It helps alleviate the problem of insufficient model generalization capability owing to data scarcity by establishing a data augmentation pipeline. Recently, remarkable progress has been achieved in both explicit defect image generation and implicit defect feature synthesis approaches. However, the existing methods are always conducted in Euclidean space. Constrained by the flatness of Euclidean space, it is difficult to synthesize defect data containing complex structures. In this paper, we attempt to explore the defect generation in hyperbolic space and propose a hyperbolic defect feature synthesis (HypDFS) method. By modeling the potential defect distribution via a small number of hyperbolic defect prototypes and further optimizing the synthetic defect features with the hierarchical defect contrastive loss in hyperbolic space, our HypDFS method can obtain a better generalized defect representation that is more conducive to downstream few-shot defect classification task. Extensive experiments conducted on the MVTec-FS benchmark and standard MTD dataset under the few-shot settings demonstrate that the proposed HypDFS surpasses the Euclidean baseline by a large margin, showing the promising prospects for defect synthesis in hyperbolic space.
Paperid: 2975,   Poster  
Authors: Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Jiarui Li, Qi Lv, Yiwen Tang, Li Kang, Heng Zhou, Xianqiang Gao, Yuhang Tang, Xiaofan Li, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, Xuelong Li
Title: FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denosing
Abstract: Humans naturally allocate more time before performing actual actions when handling complex tasks in the physical world. This paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains.However, the potential of testtime computing remains largely unexplored for robotic foundation models interacting with the physical world.In this work, we propose \ours: a test-time computing framework that augments flow-based Vision-Language-Action (VLA) generalist policies with value-guided sampling and cascaded action denoising, enabling higher control performance and real-time action rates for dexterous robot manipulation.\ours first incorporates a flow-based intermediate verifier to estimate state–action values for candidate actions. At test time, the policy iteratively samples multiple noisy action proposals and retains the one with the highest predicted value, yielding value-aligned, high-quality actions without retraining.To satisfy the stringent frequency demands of robot control, \ours further introduces cascaded action denoising, decoupling expensive value-guided sampling from fast action refinement. A lightweight flow denoiser asynchronously takes the selected high-value noisy action and rapidly denoises it to produce the final control signal, enabling fluid, high-rate execution.During deployment, the intermediate verifier operates at a low frequency to provide value-guided sampling, while the lite-flow denoiser continually processes selected candidates to maintain real-time control.Extensive experiments demonstrate that \ours scales flow-based VLA models effectively at test time, and achieves state-of-the-art performance across diverse simulation benchmarks and real-world dexterous robotic tasks.
Paperid: 2976,   Poster  
Authors: Qunce Xu, Jiahui Li, Yan-Pei Cao, Weihao Cheng, Tai-Jiang Mu, Ying Shan, Chuan Li, Da Chen, Yong-Liang Yang, Shi-Min Hu
Title: Beyond Reassembly: Fractured Object Recovery with Missing Parts
Abstract: We propose a novel learningbased task named fractured object recovery. Unlike previous fractured object reassembly task that only targets aligning existing parts with overlaps, our task aims to not only reassemble irrelevant parts but also predict missing parts, resulting in a complete shape recovery immediately. Our task coincides with practical experiences, where the prior knowledge of similar shapes can be leverage in the reassembly process, such that even non-overlapping parts can be reasoned into adequate locations. We also present the first learning model for the proposed task by correlating features of both existing and missing parts using a transformer, where the latter is naturally represented as missing tokens. Hence, our model can jointly estimate the poses of the existing parts and predict the shapes of the missing parts. To facilitate the task, we introduce a new dataset based on the existing fractured object benchmark by imposing different configurations of missing parts. We perform extensive evaluations to demonstrate the performance of the proposed model over baselines. The results show that joint part reassembly and prediction can be made possible and also have mutual benefits, which we believe can inspire future research and favor real applications.
Paperid: 2977,   Poster  
Authors: Ming Wang, Haoxuan Qu, Qiuhong Ke, Wei Zhou, Hossein Rahmani, Jun Liu
Title: Translating Signals to Languages for sEMG-Based Activity Recognition
Abstract: Surface electromyography (sEMG) signalbased activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into “sEMG language,” integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.
Paperid: 2978,   Poster  
Authors: Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Xinze Li, BINGYU ZHU, WUHUI DUAN, Congang CHEN, ZEYU FU, Yi Dong, Baoyuan Wu, Xiangtai Li, Guangliang Cheng
Title: Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Abstract: Multimodal Deepfakes proliferating on social media threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their singlemodality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. We present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 100k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities—image, audio, video, and audio-video talking head and supports a joint detection–localization–explanation protocol. For images, audio, and videos, we define a ternary task (real / partially manipulated / fully synthetic) with spatial or temporal localization masks for fine-grained reasoning. Talking heads are formulated as an audio-video fusion binary task targeting speaking digital humans and lip-synced avatar forgeries. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Code will be released.
Paperid: 2979,   Poster  
Authors: Zhongan Wang, Xiaoyu Wen, Lingxiao Du, Kun Li, zhiliang wu, Xingcheng Xu, Qiaosheng Zhang, Chaochao Lu, Hehe Fan
Title: VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning
Abstract: Reinforcement learning (RL) enhances reasoning capabilities in multimodal large language models (MLLMs) for video understanding. However, current methods face two coupled challenges. First, existing methods organize datasets by task types rather than reasoning capabilities. This creates a manyto-many mismatch where models learn task patterns instead of transferable reasoning abilities. Consequently, achieving ability generalization requires broad coverage across ability-task combinations, making RL training costly. Second, these methods compensate for this inefficiency through complex algorithmic modifications (e.g., specialized temporal architectures or multi-objective reward frameworks), which increase the complexity of training. To address these issues, we take a joint perspective from both the data and method sides. On the data side, we propose VAST, an ability-stratified framework that reorganizes video understanding tasks into a three-layer cognitive taxonomy spanning Perception, Reasoning, and Cognition. We further construct VAST-15K for training and VAST-Bench for evaluation. On the method side, we introduce VideoVAST, employing RL with consistency rewards for reasoning-answer alignment without architectural modifications. Experiments show that VideoVAST achieves 66.3% accuracy on MVBench and 57.4% on VAST-Bench, compared with 62.7% and 54.3% respectively for Video-R1. Under the same training settings, VideoVAST uses 72% fewer GPU hours and 96% fewer training samples. The code will be made publicly available.
Paperid: 2980,   Poster  
Authors: Qin Liu, Lavisha Aggarwal, Saptarashmi Bandyopadhyay, Vikas Bahirwani, Marc Niethammer, Ehsan Adeli, Andrea Colaco
Title: ESAM++: Efficient Online 3D Perception on the Edge
Abstract: Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent stateof-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3× faster inference with a 2× smaller model size compared to ESAM, enabling practical deployment in real-world edge scenarios. Code and models will be publicly available.
Paperid: 2981,   Poster  
Authors: Tianchen Guo, Chen Liu, Xin Yu
Title: Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
Abstract: Human perception of social environments is inherently a multiview synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a "sufficient-view" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce CVBench, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with verifiable single-view insufficiency, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs (from InternVL to Gemini 2.5 Pro) reveals a substantial performance gap, with the best models (e.g., Gemini 2.5 Pro, ~42% spatial accuracy) falling nearly 50 points behind human performance (~94%). We identify a systemic failure mechanism across all models: a dominant "Single-View Bias," whereby models ignore conflicting evidence and default to the most confident but incorrect single-view prediction. This demonstrates that current MLLMs lack the fundamental mechanisms for geometric grounding, identity persistence, and true spatio-temporal fusion. CVBench provides a rigorous diagnostic framework to catalyze the development of next-generation, cross-view–aware architectures.
Paperid: 2982,   Poster  
Authors: Tianxiao Gao, Shanwei Zhao, Shuo Fang, Shiai Zhu, Chenguang Ma
Title: QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
Abstract: Visionlanguage models (VLMs) demonstrate powerful capabilities in multimodal tasks. However, the large number of visual tokens imposes a significant computational cost. In this paper, we propose QuietPrune, a QUery-guIded Early Token Pruning method to remove redundant visual tokens within VLMs, thereby enhancing computational efficiency. Unlike previous late pruning methods, we recognize that implementing early pruning within the vision transformer (ViT) can achieve benefits in both latency reduction and accuracy maintenance. To address the semantic loss problem in early pruning, we design a lightweight adapter by performing a inverse transformation of the projector in VLMs. The proposed adapter converts the contextual query into a visual domain [Q-CLS] (Query [CLS]) token, providing textual guidance for ViT pruning. During pruning, we further introduce a semi-structured pruning scheme based on visual-textual relevance. Specifically, we group spatially adjacent 2 × 2 tokens to accommodate the visual token merging operation prevalent in mainstream VLMs. We use the mean attention scores between the [Q-CLS] token and the visual tokens as the relevance metric for each group, avoiding additional computation. Pruning is then applied at the group level based on the relevance score, preserving positional continuity. After pruning, we aggregate the redundant tokens into a single token to maintain context cues. Our method achieves up to 19.0% reduction in prefill latency while outperforming 4.2% in accuracy on the recent Qwen3-VL and InternVL3 series compared to existing late pruning methods.
Paperid: 2983,   Poster  
Authors: Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani
Title: Flow3r: Factored Flow Prediction for Visual Geometry Learning
Abstract: We propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic realworld scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or ‘flow’) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using ‘geometry latents’ from one image and the ‘pose latent’ from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r across diverse 3D benchmarks and demonstrate competitive or state-of-the-art performance, even surpassing supervised models trained with more labeled data.
Paperid: 2984,   Poster  
Authors: Xingzu Zhan, Runmin Jiang, Vatsal Gupta, Tanush Swaminathan, Yanwen Wang, Genpei Zhang, Haili Wang, Min Xu
Title: MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
Abstract: Isotropic microscopy reconstruction remains challenging because the anisotropic point spread function in optical systems yields much poorer axial resolution and hampers accurate 3D analysis. Hardware strategies can approach isotropy, yet they are complex, costly, susceptible to sidelobes, and introduce phototoxicity. Deep learning based approaches reduce acquisition burden, but common synthetic pipelines blur with Gaussian kernels that do not match the physical degradation, and many methods lack explicit volumetric geometry constraints since they process 2D slices independently. These gaps lead to lowfidelity reconstructions. To address these challenges, we present MicroFM, which synthesizes realistic training data using physical PSFs matched to the target microscope. MicroFM also introduces the first flow-matching framework for 3D microscopy reconstruction, guided by a continuous implicit geometry prior to achieve high-fidelity isotropic recovery. Across four fluorescence microscopy systems and datasets, MicroFM achieves state-of-the-art performance, producing sharper structures, more isotropic spectra, and substantial gains in both full-reference and no-reference metrics.
Paperid: 2985,   Poster  
Authors: Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, Xiaowei Zhou
Title: Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction
Abstract: This paper addresses the task of largescale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\citeSchops2019CVPR and Oxford Spires~\citetao2025spires datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving state-of-the-art performance in both pose estimation and 3D reconstruction accuracy while maintaining efficiency.
Paperid: 2986,   Poster  
Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Shuimu Zeng, Haiying Xia, Shuxiang Song
Title: Learning to Track Instance from Single Nature Language Description
Abstract: How to achieve visionlanguage (VL) tracking using natural language descriptions from a video sequence without relying on any bounding-box ground truth? In this work, we achieve this goal by tackling self-supervised VL tracking, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \tracker, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token unequally. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that \tracker surpasses SOTA self-supervised methods, achieving an improvement of more than 11.2%, 5%, and 3.3% in AUC score on the OTB99, LaSOT, and TNL2K datasets, respectively.
Paperid: 2987,   Poster  
Authors: Liwen Wang, Xingbo Dong, Iman Liao, Zhe Jin
Title: Towards Stable Federated Continual Test-Time Adaptation in Wild World
Abstract: Federated Learning (FL) enables collaborative model training while preserving privacy, but faces challenges with client data heterogeneity and domain shifts during deployment. Although Personalized Federated Learning (PFL) mitigates heterogeneity, it typically requires labelled data from target clients, which is an impractical assumption. TestTime Adaptation (TTA) offers label-free adaptation, yet its direct use in a continual federated setting risks destabilizing the global model and causing catastrophic forgetting. To address this, we consider the Federated Continual Test-Time Adaptation (FedCTTA) setting, where unlabeled clients arrive sequentially, requiring online adaptation and continuous global model updates. We propose BPFedCTTA, a framework that employs Bayesian Prior-guided Adaptation (BPA) for stable local adaptation via Maximum a Posteriori estimation, and Uncertainty-Gated Single-client Aggregation (UGSA) to selectively integrate updates based on client uncertainty. This approach balances adaptation with knowledge retention, thereby mitigating forgetting. Extensive experiments on cross-domain classification and segmentation show BPFedCTTA outperforms existing FL, PFL, and TTA methods in sequential adaptation and global model improvement. The source code will be made public upon acceptance.
Paperid: 2988,   Poster  
Authors: Qinzheng Zhou, Zaychik Liu, Lijing Lu, Zhihang Li
Title: Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
Abstract: Sparseview 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further proposeCAGS, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process‌. Our approach employs a matching-based weighting aggregation strategy to anchor high-resolution reconstructions to stabilize structural priors and filtering noise through cross-scale consistency, and a multi-scale pseudo-view regularization to refine local details without amplifying noise. Extensive experiments on the LLFF and Mip-NeRF360 datasets demonstrate that CAGS significantly outperforms existing methods, particularly under demanding high-resolution conditions. ‌Moreover, our paradigm can be seamlessly integrated into other 3DGS-based pipelines, thereby extending the field from low-resolution reconstructions to high-fidelity outputs under real-world sparse-view constraints.
Paperid: 2989,   Poster  
Authors: Feiyang Pan, Shenghe Zheng, Chunyan Yin, Guangbin Dou
Title: Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry
Abstract: VisualInertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network , before an RL policy adaptively fuses the full (p, v, q) state.Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy–efficiency–memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to 1.77× faster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.
Paperid: 2990,   Poster  
Authors: Jiankang Hong, Ye Luo, Yinan Liu, Junsong Yuan
Title: From Infusion to Assimilation Distillation for Medical Image Segmentation
Abstract: Although foundation models (e.g. SAM) perform remarkably in medical image segmentation, its high computational complexity limits deployment. Knowledge distillation (KD) allows lightweight models to inherit the representational capabilities of large models, thereby mitigating this issue. Existing KD methods enhance student performance, but due to teacherstudent different feature advantages, they neglect to internalize and integrate student's semantic information adaptively after knowledge transfer, causing poor knowledge assimilation and limiting gains and generalization. To address this limitation, we propose a novel medical image segmentation framework, which is injection to assimilation distillation (IAD). In Knowledge Injection Stage (KIS), to semantically align teacher-student prediction distributions, soft-label distillation is combined with class-weighted prototype alignment strategy. In Knowledge Assimilation Stage (KAS), to promote adaptively semantic assimilation, a contrastive semantic self-optimization strategy refines student predictions through positive and negative sample pairs and imposes reverse constraints on encoder features to enhance semantic consistency. IAD achieves DICE gains of 4.32% on Synapse, 1.85% on ACDC, and 2.42% on Polyp datasets, and delivers an average 4.16% generalization gain on ISIC2018, PH2, BUSI, and STU datasets, outperforming mainstream KD methods.
Paperid: 2991,   Poster  
Authors: LI XIANG, Yali Li, Yuan Wang, Shengjin Wang
Title: TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
Abstract: VisionLanguage-Action (VLA) models have emerged as a powerful paradigm for general robotic manipulation. However, existing approaches typically omit intermediate reasoning steps and directly regress actions, limiting reasoning interpretability and performance in long-horizon or compositional tasks. Although recent studies introduce Chain-of-Thought (CoT) reasoning into VLA models, their effectiveness remains suboptimal due to two key issues: (1) generating a full reasoning trajectory at every timestep introduces substantial redundancy, thereby hinders real-time deployment and (2) reasoning is performed independently, neglecting temporal consistency, which leads to planning conflicts.We propose TRM-VLA, a temporal-aware reasoning and memorization framework that integrates explicit temporal modeling into the VLA reasoning process. TRM-VLA consists of two core components: (1) Keyframe-Triggered Reasoning (KTR), which identifies task progress and performs hierarchical CoT reasoning only at key decision points to reduce redundant inference; and (2) Granularity-adaptable Context Memory (GCM), which dynamically stores and retrieves historical reasoning trajectories to maintain inter-frame coherence and global context. Built upon a dual-system architecture—combining a multimodal foundation model for slow reasoning (System 2) with a diffusion-based policy for fast execution (System 1)—TRM-VLA learns to plan and act efficiently in a unified manner. Extensive experiments on LIBERO-90, SIMPLER, and four real-world robotic tasks demonstrate that TRM-VLA achieves state-of-the-art performance while improving reasoning efficiency.
Paperid: 2992,   Poster  
Authors: Sanghyeon Lee, Minwoo Lee, Euijin Shin, Kangyeol Kim, Seunghwan Choi, Jaegul Choo
Title: ORPO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
Abstract: We introduce the Orthogonal PanelRelative Operator (OPRO), a novel parameter-efficient adaptation method for tiled-panel In-Context Generation (ICG) that utilizes the pre-trained Diffusion Transformers (DiTs). OPRO works by composing learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two properties: 1) Isometry, which maintains feature geometry to promote stable fine-tuning, and 2) Same-Panel Invariance, which perfectly preserves the model's powerful pre-trained intra-panel synthesis capabilities. We conduct a controlled analysis demonstrating that OPRO's effectiveness is not limited to RoPE but consistently enhances performance across various positional encodings that satisfy orthogonality. By enabling effective panel-relative learning while simultaneously protecting the backbone's core synthesis power, OPRO consistently improves ICG-based instructional image editing methods, including state-of-the-art methods ICEdit.
Paperid: 2993,   Poster  
Authors: Yutian Chen, Yuheng Qiu, Ruogu Li, Jay Patrikar, Sebastian Scherer
Title: Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
Abstract: We propose ConfidenceGuided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to 11.3× and 7.2× speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
Paperid: 2994,   Poster  
Authors: Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou
Title: Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Abstract: Inferring rigidbody physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce Δynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, Δynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, Δynamics achieves a segmentation IoU of 0.30, a 7× improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.
Paperid: 2995,   Poster  
Authors: Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song
Title: An Efficient Token Compression Framework for Visual Object Tracking
Abstract: Refining visual representations by eliminating their internal featurelevel redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. This fusion is performed through a cascade of collaborative stages, where each stage executes a structured process of template enrichment via search context, unified feature learning, and search feature refinement to ensure precise target localization. Experiments on seven benchmarks demonstrate that our method significantly outperforms current state-of-the-art trackers.
Paperid: 2996,   Poster  
Authors: Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang
Title: Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
Abstract: Although recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and highquality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with the realistic detail priors while preserving the structural consistency with the 3D-native geometry. While our scheme is general to different 3D-native generators, we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D‑native generation paradigms and achieves state‑of‑the‑art photorealistic 3D generation performance. Codes, models and datasets will be released.
Paperid: 2997,   Poster  
Authors: Zijun Deng, Yuxin Peng
Title: NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
Abstract: While recent video generation models achieve impressive visual quality, generating physically plausible videos remains challenging, especially for fluid dynamics and rigidbody motions. To address this, we presentNS-Diff, a physics-guided reinforcement learning framework for video diffusion. First, we design a noise-robust physical dynamics detector that distinguishes rigid and fluid regions by analyzing motion in noisy latent frames. Second, we introduce a Physics-Conditioned Latent Injection module, which encodes velocity fields, deformation gradients, and material masks, and injects them into the DiT denoiser via cross-attention. Third, we introduce a reinforcement learning optimization module that enforces simplified Navier-Stokes constraints on fluid dynamics and minimum-jerk principles on rigid bodies through policy gradients. Experiments on PhysVideoBench, UCF, and MSR-VTT show that our approach reduces jerk errors by 43%, decreases fluid divergence by 33%, and improves FVD by 22.7%, achieving higher physical plausibility and visual quality.
Paperid: 2998,   Poster  
Authors: Yucheng Wang, Zedong Wang, Yuetong Wu, Yue Ma, Dan Xu
Title: CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Abstract: Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multicondition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts. The source code, models, and dataset will be publicly available.
Paperid: 2999,   Poster  
Authors: Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu, Guoming Wang, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Siliang Tang
Title: Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we proposeSCALE(SelfCognitive-AwareLearning andExploration), which leverages three advertise roles——selector,predictor, andjudgerto autonomously discover their limitations and expand its cognitive boundaries through the environment exploration. Moreover, we proposeSCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we constructSCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE’s exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.
Paperid: 3000,   Poster  
Authors: Xuanang Gao, Ning Zhiwei, Gengming Zhang, Jiaxi Cao, Runze Yang, Zhonglong Zheng, JIE YANG, Rong Xiao, Wei Liu
Title: RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model
Abstract: Robust depth estimation aims to maintain highquality depths across diverse conditions. However, most existing methods estimate depth without taking into account the object-level information. As a result, the predicted depth may easily deviate within objects and become blurred under adverse conditions. To overcome this weakness, we propose RoSAMDepth, a novel framework that can assist robust self-supervised depth estimation in leveraging rich and diverse object-level priors from the Segment Anything Model (SAM). We focus on incorporating object-level information across three key aspects: a segment-guided representation contrasting method that injects object-level awareness into the feature representation space; an adaptive regional outlier masking strategy combined with a regional Gaussian likelihood loss that enforces regional depth smoothness; and an object-level reliability estimation strategy that mitigates the influence of unreliable supervision. Extensive experiments across multiple datasets and diverse weather conditions demonstrate that our method produces sharper, more accurate depth predictions, consistently outperforming state-of-the-art methods.
Paperid: 3001,   Poster  
Authors: Zhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu
Title: Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
Abstract: Despite the rapid progress of textto-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BIDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BICOMP, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BIDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.
Paperid: 3002,   Poster  
Authors: Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh
Title: CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained VisionLanguage Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves state-of-the art retrieval accuracy and notable computational efficiency compared to previous works.
Paperid: 3003,   Poster  
Authors: Linfei Li, Pei Tan, Siqi Li, Changqing Zou, Yue Gao
Title: Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling
Abstract: Point cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformerbased and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling of global structures. Extensive experiments on multiple datasets demonstrate that our approach consistently outperforms state-of-the-art methods.
Paperid: 3004,   Poster  
Authors: Biplab Das, Shouvik Das, Viswanath Gopalakrishnan
Title: Grid Distillation: Compositional Image Distillation via Structured Generative Grids
Abstract: We present Grid Distillation, a generative dataset distillation framework that compresses largescale datasets into a compact set of informative synthetic samples. Our method constructs high-resolution compositional grids via spectral submodular optimization, which injects world knowledge from CLIP representations to maximize semantic coverage and diversity. These grids are then downsampled into low-resolution distilled images optimized for diversity and representational efficiency. During training, a single-step diffusion reconstruction (based on Stable Diffusion Turbo) restores fine-grained spatial details from diffusion priors, bridging the gap between compact representations and natural image statistics. A grid-aware cropping strategy further enhances discriminability by probabilistically aligning crops with grid boundaries, maintaining compatibility with standard 224×224 inference inputs. Experiments on ImageWoof, ImageNette, ImageIDC, and ImageNet-1K demonstrate consistent improvements over existing dataset distillation methods across multiple IPC settings.
Paperid: 3005,   Poster  
Authors: S Divakar Bhat, Amit More, Mudit Soni, Bhuvan Aggarwal
Title: AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
Abstract: LongTail Class Incremental Learning (LTCIL) combines two fundamental challenges: catastrophic forgetting of past tasks and severe class imbalance. Existing approaches mitigate one challenge at a time, through rehearsal, reweighting, or classifier alignment, but they typically assume \emphstatic priors and rely on multi-stage training. In contrast, we propose AdaPrior, a simple Bayesian framework that treats LTCIL as a problem of \emphdynamic prior misalignment. Our key idea is to estimate model-induced priors online via an exponential moving average and use them for (i) debiasing during training (AdaPrior Loss), and (ii) lightweight post-hoc correction at inference. The combined approach unifies loss-level and inference-level debiasing without additional stages or heavy computation. We provide theoretical analysis showing that AdaPrior’s prior estimator converges to the true model prior and that its logit adjustment yields well calibrated posteriors under mild assumptions. Extensive experiments on CIFAR100-LT, Food-101-LT, ImageNet-LT-subset, and iNaturalist18-subset demonstrate consistent gains over recent LTCIL baselines. Beyond accuracy, AdaPrior improves calibration, and forgetting curves, making it a practical and scalable solution for long-tail continual learning.
Paperid: 3006,   Poster  
Authors: Dongjun Liu, Weichen Dai, Jingsheng Qian, Honggang Liu, Hangjie Yi, Wanzeng Kong
Title: Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
Abstract: Brain visual decoding aims to recognize and reconstruct perceptual visual content from neural activity, representing a promising avenue for developing braincomputer interfaces and building brain-inspired artificial intelligence. However, this task faces a fundamental challenge of information asymmetry: while natural images contain complex visual scenes with objects and backgrounds, the corresponding brain signals reflect focused attention on central objects while being contaminated by various neural noise. Previous methods that directly align visual and brain representations often overlook this inherent asymmetry, resulting in suboptimal decoding performance. To address this, we propose linguistic-prior-guided visual decoupling method, which introducing object-oriented textual descriptions as semantic guidance to explicitly decouple foreground objects from complex backgrounds in natural images, thereby establishing symmetric vision-brain alignment. This design enables the model to automatically focus on task-relevant visual concepts while effectively filtering out irrelevant neural noise in brain signals, achieving a transition from asymmetric feature alignment to semantic symmetric alignment. Extensive experiments on the THINGS-EEG and THINGS-MEG datasets demonstrate that our method achieves new state-of-the-art performance in the zero-shot brain-to-image retrieval task.
Paperid: 3007,   Poster  
Authors: zhi zhu, YaoQi Fan, Zhe Chen, Yue Cao, Yangzhou Liu, Tong Lu
Title: Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has revealed the limitations of existing benchmarks in evaluating complex reasoning over multiple images. To address this gap, we introduce MIRACLE, a novel benchmark for MultiImage complex Reasoning And Comprehension Logic Evaluation, featuring 4,000 questions across diverse reasoning types such as visual comparison, temporal sequencing, and spatial relations, with each question involving an average of seven tightly correlated images. MIRACLE emphasizes strong inter-image dependencies through a systematic data collection process, followed by delicate instance grouping and question design that enforce cross-image reasoning.Evaluation on leading MLLMs shows that even top-performing models like Gemini-2.5-Pro achieve only 55.91% points, highlighting the significant challenges of multi-image reasoning. Moreover, in scenarios characterized by high visual information density, such as puzzle tasks and ultra multi-image input conditions, all models exhibit a significant drop in performance, which highlights the limitations of MLLMs in handling complex structural relations and collaborative reasoning, revealing deficiencies in their cognitive capabilities under high-load visual reasoning settings. We hope MIRACLE will inspire the community to push the boundaries of multi-image reasoning. The benchmark shall be released.
Paperid: 3008,   Poster  
Authors: Dimitrios Katsikas, Nikolaos Passalis, Anastasios Tefas
Title: Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
Abstract: In this work, we exploit class dissimilarities in a rather novel way, which provides complementary learning information beyond correct classification, that is not fully utilized in existing learning paradigms. To model these dissimilarities, we introduce the concept of an oppositeclass, which consists of everything that is not part of a corresponding class, i.e., all samples from non-target classes or samples from unknown classes. By setting appropriately encoded target distributions over the non-target classes, we explicitly optimize the model’s activation distributions across all non-target classes, which enhances class dissimilarity information and enables better control over the geometry of the learned representations. We analyze the convergence dynamics of our proposed approach, both theoretically and empirically, showing that it naturally pushes the representations towards neural collapse, leading to more discriminative and robust features. Our extensive evaluation across multiple classification settings demonstrates consistent improvements of our method on closed-set, open-set, few-shot classification, and domain generalization benchmarks. Our code is available at: (withheld for review, demo in supplementary material).
Paperid: 3009,   Poster  
Authors: Xingyang Li, Samuel Tesfai, Zhekai Zhang, Haocheng Xi, Shuo Yang, Lvmin Zhang, Yufei Sun, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Jun-Yan Zhu, Song Han, Yujun Lin, Muyang Li
Title: DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
Abstract: Video diffusion models have achieved remarkable generative performance, but their substantial computational and memory costs pose significant challenges for deployment, especially on consumer GPUs. As recent advances in attention optimization mitigate previous computational bottlenecks, linear layers now dominate both computational cost and inference memory. In this work, we focus on quantizing both weights and activations to 4 bits to accelerate these layers. Previous methods, such as SVDQuant, overlook the highly dynamic nature of activations across denoising timesteps, where outlier channels and magnitudes vary dramatically. However, video data inherently exhibits strong activation similarity among neighboring tokens in space and time, which we term spatiotemporal activation similarity, analogous to how video codecs exploit intraand inter-frame redundancy. Leveraging this property, we introduce DeltaQuant, which partitions activations into local 3D spatiotemporal cubes and uses each cube's mean token as a \coretoken, quantizing only the small differences (delta tokens) to 4 bits while keeping core tokens in FP8. This decomposition substantially reduces quantization error with minimal overhead.For weight quantization, DeltaQuant incorporates SVDQuant's low-rank decomposition to further reduce quantization error.We also implement an efficient kernel that translates DeltaQuant's computational benefits into real-world speedups.Extensive experiments on Wan 2.2 I2V, Wan 2.2 T2V, and LTX-Video T2V demonstrate that DeltaQuant maintains high generation fidelity.On Wan 2.2, it compresses model size by 2.9× and reduces memory footprint by 2.3×. DeltaQuant is compatible with efficient attention mechanisms and few-step distillation. When integrated with these techniques, it achieves an additional 3.0× acceleration, for a total 111.8× end-to-end speedup. Code and models will be released upon publication.
Paperid: 3010,   Poster  
Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui
Title: InterPrior: A Scalable Motion Prior for Physics-Based Human-Object Interactions
Abstract: Humans rarely plan wholebody interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified control policy, i.e., interaction motion prior through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multi-modal and partially specified goal cues. A targeted diversity process, combining data augmentation and physical perturbations, broadens exposure to varied contact and object conditions, producing a motion prior that generalizes beyond the training data. To address the vast configuration space of large-scale human-object interaction, a reinforcement learning finetuning enhances unseen goal competence, enabling recovery from unsuccessful grasp. The resulting policy acts as a reusable motion prior that can absorb new behaviors, including interactions with unseen objects. We also show its effectiveness in user-interactive control and across different embodiments.
Paperid: 3011,   Poster  
Authors: Yufei Li, Long Tian, Yuyang Dai, Wenchao Chen, Liang Bao, Xiyang Liu
Title: FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
Abstract: Fewshot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on obtaining prototypes from limited normal images, they neglect to systematically incorporate statistics of query image to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process during inference: (1) characteristic transfer from query features to the enhanced prototypes, and (2) anomaly suppression by aligning the enhanced prototypes with their normal counterparts. The characteristic transfer is achieved through linear reconstruction of query features from prototypes with an optimizable transport matrix, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable in characteristic transfer. Therefore, we employ optimal transport to measure while minimize the gap between prototypes and their enhanced counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and efficiency of our approach under 1/2/4-shots.
Paperid: 3012,   Poster  
Authors: Jiahao Zhou, Chenghao Xu, Wei Wang, Erkun Yang, Cheng Deng
Title: EEGiT: Teaching Vision Transformers to Understand the EEG signal
Abstract: Decoding visual stimuli from electroencephalography (EEG) signals is a crucial step toward practical brain–computer interfaces (BCIs). However, this task requires largescale and high-quality EEG–image paired datasets. Compared with abundant image data, the limited EEG recordings restrict the decoding models’ performance. To address this challenge, we propose EEGiT, a framework that converts sequential EEG signals into image-like EEG patches and enables the direct use of a pretrained Vision Transformer (ViT) as the EEG encoder. To preserve the spatial topology of brain regions and minimize distributional differences across channels, we group EEG electrodes according to anatomical structures and apply linear interpolation along the spatial dimension. We then resample the EEG signals to align the structure of EEG patches with that of image patches in ViT. This design encourages effective transfer of visual priors learned from large-scale image datasets to EEG representation learning. Experiments on the THINGS-EEG and EEG-3D datasets show that fine-tuning pretrained ViTs improves EEG-to-image retrieval and EEG-based visual classification, while maintaining robustness and strong cross-subject generalization. These results demonstrate a promising direction for leveraging powerful vision models to mitigate data scarcity in EEG decoding.
Paperid: 3013,   Poster  
Authors: kaiwen Huang, Yi Zhou, Yizhe Zhang, Jingxiong Li, Tao Zhou
Title: SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
Abstract: Semisupervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency.Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods.
Paperid: 3014,   Poster  
Authors: Jianxun Mi, Lu Pan, Weisheng Li
Title: Making the Classification Explanation Faithful to the Confidence Score
Abstract: Deep Neural Networks (DNNs) have revolutionized numerous industries, yet their decisionmaking processes remain largely opaque. Most existing explanation methods visualize the importance of image regions that influence a classifier's decisions, but they predominantly focus on identifying regions with positive contributions, often overlooking those with negative impacts. In this paper, we introduce a novel black-box explanation method, the Metropolis-Hastings Explainer (MHE), designed to provide confidence-faithful explanations. MHE enhances the fidelity of explanations by ensuring that the explained regions closely align with the original confidence score, sampling instances that best match the classifier’s confidence. Furthermore, MHE improves sampling efficiency by utilizing existing valid samples to explore more potential valid ones, reducing computational overhead. To enhance the clarity of explanations, MHE prioritizes valid samples with smaller areas when other factors are equal, thereby reducing the explanation area. Building upon the MHE framework, we propose two extensions: MHE-e, which focuses exclusively on regions with positive contributions, and MHE-pro, which refines explanation quality by integrating multi-scale information. MHE-pro progressively regions, optimizing both sampling efficiency and explanation quality. Experimental results demonstrate that MHE delivers superior and stable explanation quality across various models, including ResNet50, VGG16, ViT DINO, and CLIP, on datasets such as ImageNet, CUB-200-2011, and VOC2012, providing explanations that closely approximate the original classification confidence.
Paperid: 3015,   Poster  
Authors: Yuhao Qing, Yueying Wang, Chaoyang Chen, Weidong Zhang, Jie Wen, Xin Xu
Title: S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
Abstract: Openvocabulary semantic segmentation extends pixel-level recognition to arbitrary text-described categories. Despite strong global semantic understanding, vision-language models such as CLIP exhibit limited spatial precision and semantic ambiguity across large vocabularies, constraining their effectiveness for dense prediction. We present S2C2Seg, a training-free framework that integrates with existing methods through Category Subset Selection (CSS) and Consistent Semantic Guidance (CSG). CSS employs three complementary scoring functions to filter category candidates: CLIP-based global semantic similarity, spatial presence from dense prediction models, and multi-view consistency via alignment and conditional entropy. This joint exploitation of semantic, spatial, and consistency cues reduces category redundancy and semantic ambiguity. CSG adaptively fuses CLIP global features with local spatial predictions through category-specific confidence weighting, applying stronger regularization to high-similarity categories for correcting prediction biases while preserving spatial precision for low-confidence categories. Extensive experiments across eight benchmarks demonstrate broad applicability: when integrated with SCLIP, ProxyCLIP, and CorrCLIP, S2C2Seg achieves consistent improvements of 3.4 to 9.7 percentage points in mIoU, establishing a new state-of-the-art of 51.2% average mIoU.
Paperid: 3016,   Poster  
Authors: Zhao-Min Chen, Xinjian Huang, Yisu Ge, Yu Li
Title: SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
Abstract: Due to the prohibitive cost of data annotation and the impossibility of exhaustively enumerating all defect categories, municipal sewer pipe defect detection poses significant generalization challenges for traditional models. MultiLabel Zero-Shot Learning (ML-ZSL) offers a viable solution to address this challenge. However, existing methods struggle to establish robust and fine-grained visual-semantic alignment between the complex visual environment inside the pipes and the often sparse semantic descriptions, leading to a critical issue: Alignment Ambiguity. To mitigate this, we propose a novel Steering-Fusion-Refining Network (SFR-Net) that follows a three-stage paradigm to progressively dissolve this ambiguity. This is achieved as the Representation Steering (RS) module first integrates a parameter-efficient feature steering mechanism to continuously adapt the representation to the pipe scene; the Multi-Granularity Evidence Fusion (MEF) module subsequently aggregates unambiguous multi-granularity visual evidence through decoupled global and local paths; and the Generalized Relational Score Refining (GR) module ultimately learns and transfers relational logic from seen defects to gain a universal score correction ability, directly refining preliminary prediction scores and significantly boosting the model’s zero-shot generalization and prediction consistency. Extensive experiments on the public Sewer-ML dataset and our private WZ-Pipe dataset demonstrate that the proposed SFR-Net achieves state-of-the-art (SOTA) performance in multi-label zero-shot learning task.
Paperid: 3017,   Poster  
Authors: Youngho Yoon, Wonjune Cho, Hyunho Ha, Sujung Kim, Kuk-Jin Yoon
Title: ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
Abstract: Pose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extremeview settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal interpolation, but also ensures spatial coherence across views, significantly improving pose estimation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines, highlighting the potential of preference-optimized video generation as a powerful tool for pose estimation in extreme-view scenarios.
Paperid: 3018,   Poster  
Authors: Seongyu Kim, Seungwoo Lee, Hyeonggon Ryu, Joon Chung, Arda Senocak
Title: Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Abstract: We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuotactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local tactile–visual alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.
Paperid: 3019,   Poster  
Authors: Zhimeng Huang, Rongao Yuan, Junlong Gao, Qi Mao, Siwei Ma, Wen Gao, Chuanmin Jia
Title: Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression
Abstract: Traditional image compression prioritizes pixel fidelity but often preserves details irrelevant to downstream vision tasks. Compressing taskspecific representations instead better aligns with task semantics, yet redundant information persists across correlated tasks. Existing multi-task compression methods typically rely on static dependency structures, leading to redundant bit allocation across correlated tasks and suboptimal rate-distortion performance. We present Adaptive Task Dependency Compression (ATDC), a framework that models per-image task relationships and encodes representations following an adaptive directed acyclic graph (DAG). ATDC infers pairwise task predictability via a learned correlation matrix, constructs a dynamic DAG to determine the optimal compression order, and encodes each task conditionally on its predecessors, achieving predictive redundancy removal and asymmetric information sharing across tasks. Experiments on the Taskonomy dataset demonstrate consistent gains in rate–distortion efficiency and task accuracy over both human-oriented codecs and state-of-the-art multi-task compression methods.The learned DAGs reveal interpretable, content-dependent task hierarchies, establishing adaptive dependency modeling as a principled paradigm for multi-task representation compression.
Paperid: 3020,   Poster  
Authors: Zehua Zang, Xi Wang, Fuchun Sun, Xiao Xu, Lixiang Liu, Jiahuan Zhou, Jiangmeng Li
Title: Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models
Abstract: VisionLanguage-Action models (VLAs) achieve strong performance in sequential decision-making but remain fragile to subtle environment shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to spurious cues and replicate memorized actions. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates spurious correlations through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting high-confidence errors. Experiments on LIBERO (+7.4% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains in task success over strong VLA, test-time adaptation and even fine-tuned approaches baselines with minimal overhead, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents.
Paperid: 3021,   Poster  
Authors: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao
Title: PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
Abstract: Hand–object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) poseonly synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose–Appearance–Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480×720 videos compared to 256×256/256×384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 → 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
Paperid: 3022,   Poster  
Authors: Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen
Title: Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Abstract: Employing multimodal large language models (MLLMs) in 3D physical environments demands complex spatial reasoning capabilities that integrate geometric understanding, viewpoint synthesis, finegrained perception, and robust depth estimation. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We address this gap through the lens of wide-baseline matching (WBM)---determining whether two views with large viewpoint changes, appearance shifts, and occlusions depict the same scene element. We introduce ReasonMatch-Bench, a comprehensive benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios. Our evaluation reveals substantial gaps between human performance and state-of-the-art MLLMs, particularly for smaller models, highlighting critical deficiencies in spatial reasoning. To bridge this gap, we propose a scalable data generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora (RGB-D videos and SfM reconstructions), providing diverse, verifiable supervision. Leveraging verifiable matching accuracy as rewards, we introduce Dynamic Correspondence Reinforcement Learning (DCRL), combining Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to enable progressive acquisition of sophisticated spatial reasoning without explicit supervision. Extensive experiments demonstrate that our approach significantly enhances MLLMs' spatial reasoning capabilities, narrowing the gap with human performance on complex 3D understanding tasks.
Paperid: 3023,   Poster  
Authors: Zuyan Zhao, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Title: UniPercept: A Unified Diffusion Model for Generalizable Visual Perception
Abstract: Diffusion models have shown impressive performance in generative tasks, demonstrating their ability to capture detailed structural and semantic information. Recently, these capabilities have been extended to visual understanding, with studies employing diffusion models as the backbone for various perception tasks. However, existing diffusionbased perception models are generally restricted to a single task or a fixed set of predefined tasks, lacking an efficient mechanism to generalize to novel tasks. To overcome this limitation, we propose a unified DiT-based perception framework called UniPercept, which introduces a novel foundation–adapter paradigm for general visual perception. In this framework, a shared diffusion-based foundation model is trained to capture common and generalizable visual knowledge across diverse perception tasks, with task-specific adapters integrated for each individual task. Leveraging its superior generalization ability, the foundation model can be efficiently adapted to novel domains through lightweight adapters, requiring as few as1,000training samples and less than1%of trainable parameters. Furthermore, UniPercept demonstrates strong performance across various perception tasks, outperforming state-of-the-art generalist models in most cases and achieving accuracy comparable to specialist models.
Paperid: 3024,   Poster  
Authors: Panwang Pan, Chenguo Lin, Chenxin Li, Jingjing Zhao, Yuchen Lin, Haopeng Li, yunlong lin, Kairun Wen, Yixuan Yuan, Yadong Mu
Title: Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
Abstract: We introduce Diff4Splat, a feedforward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates high-fidelity dynamic scenes within 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.
Paperid: 3025,   Poster  
Authors: Haozhe Chen, Rui Li, 正宝 王, Xinhao Zhu, Linjie Li, Tianyu Xiong, Xuan Ouyang, Jiaqi Yang
Title: Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
Abstract: Unsupervised nonrigid point cloud correspondence aims to predict point-to-point correspondences without annotations. Existing methods leverage the spatial-relation-based feature propagation strategy that includes non-physical connections, which are sensitive to non-rigid deformation. To address this issue, we advocate to learn shape topology robust to non-rigid deformation, and propose the topology-aware feature propagation module integrated into a coarse-to-fine propagation and optimization pipeline. To extract point features robust to non-rigid deformation, we estimate keypoints as superpoints and encode superpoint features with topology weights, which learns reasonable topologies under non-rigid deformation. The vector quantization codebook is leveraged to enhance the original superpoint features with stored representative features across the dataset, improving feature robustness against shape variance. Robust point-wise correspondence is yielded after coarse-to-fine feature fusion and efficient test-time optimization. Extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of our method.
Paperid: 3026,   Poster  
Authors: Chaoyue Xing, Wei Mao, Miaomiao Liu
Title: InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
Abstract: This paper tackles the problem of physicsaware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics. Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.
Paperid: 3027,   Poster  
Authors: Yiming CAO, Dong Wang, Xinqi Lyu, Bin Xiao
Title: PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
Abstract: Large VisionLanguage Models (VLMs) exhibit impressive multimodal capabilities and widespread deployment, yet remain vulnerable to targeted adversarial attacks. However, the practical robustness of such attacks often remains unclear with limited evaluation under defenses. Diffusion-based purification (DBP), a widely adopted black-box defense for VLMs, effectively blocks current attacks by removing adversarial perturbations via generative diffusion. Prior DBP evasion methods are designed for white-box image classifiers and are ill-suited for attacking VLMs. Even when adapted, they face high computational costs and potential vanishing/exploding gradient from backpropagating through deep diffusion steps and gradient instability due to diffusion’s stochasticity. To address these challenges, we present PureProof, a black-box targeted attack on VLMs resilient to DBP. PureProof introduces Stochastic Reverse Alignment, using a single-step reverse prediction to efficiently guide adversarial optimization while avoiding costly and unstable full-trajectory backpropagation. To mitigate diffusion stochasticity, we employ Adaptive Re-noising Augmentation, which re-noises intermediate predictions in a timestep-adaptive manner to smooth the optimization landscape, complemented by Self-Consistency Regularization to promote local temporal coherence. Extensive experiments on open-source and commercial VLMs show that PureProof consistently outperforms prior attacks against DBP, achieves strong noise resilience, and remains highly effective without defenses, revealing critical vulnerabilities of VLMs and offering implications for future model safety.
Paperid: 3028,   Poster  
Authors: Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang
Title: PolySLGen: Online Multimodal Speaking–Listening Reaction Generation in Polyadic Interaction
Abstract: Humanlike multimodal reaction generation is essential for natural group interactions between humans and intelligent embodied AI. However, existing approaches are often limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and the complex dynamics of polyadic interactions, both critical for engagement and conversational coherence.In this work, we presentPolySLGen, an online framework forPolyadic multimodalSpeaking andListening reactionGeneration. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking-state score.To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multimodal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism. Source code will be made publicly available.
Paperid: 3029,   Poster  
Authors: Bingyi Cao, Koert Chen, Kevis-kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Shekhar Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo
Title: PANDA: Pretraining for vision ANd language with Dense Alignment
Abstract: Recent progress in visionlanguage pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop PANDA (Pretraining for vision ANd language with Dense Alignment), a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models.
Paperid: 3030,   Poster  
Authors: Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang
Title: STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Abstract: Multimodal Large Language Models (MLLMs) remain far from humanlevel performance in multi-view spatial reasoning, where models must establish object correspondences across view and infer coherent scene semantics. We analyze this limitation through the Transformation-Driven Visual Reasoning (TVR) task and find that Supervised Fine-Tuning (SFT) fails to capture cross-view consistency, whereas reinforcement learning (RL) fails to reliably identify key referential objects. To bridge this gap, we introduce multi-View Spatial TrAnsformation Reasoning (STAR-R1), a two-stage framework that combines process-supervised SFT with a referential-aware RL paradigm. STAR-R1 first learns structured spatial reasoning trajectories from high-quality CoTs and then uses fine-grained rewards on referential selection and answer correctness to encourage effective exploration and robust scene interpretation. Despite using only a small amount of high-quality training data, STAR-R1 surpasses state-of-the-art models with far more training data on the multi-view spatial understanding benchmarks TVR, MMSI-Bench, MindCube-Bench, and SPAR-Bench. Our study reveals the overlooked potential of RL in multi-view spatial understanding and points a way toward potentially achieving more human-like spatial reasoning in MLLMs.
Paperid: 3031,   Poster  
Authors: Shenghai Yuan, Wei Yihan, Jason Yee, Zhuoran Qiao, boyang lou, Enwen Hu
Title: Adaptive 3D Perception Under Sparse Sampling via Reinforcement Learning
Abstract: Detecting small aerial targets (SATs) from longrange LiDAR is challenging because point density changes dramatically with motion: fast flights produce ultra-sparse returns, while hovering or slow motion yields dense local clusters, breaking fixed-voxel and static-threshold assumptions in standard 3D detectors and trackers. We introduce A3PRL, an RL-driven adaptive perception framework that closes the loop between LiDAR sensing and tracking. A3PRL builds on a sparsity-aware proposal stage with Temporal Dispersion Signatures and velocity-change cues, and deploys a lightweight 5D policy that jointly adjusts voxel resolution, detection sensitivity, and association gating based on purely label-free statistics summarizing spatio–temporal sparsity, foreground acceptance, and tracking continuity. The policy is trained with privileged supervision from ground-truth trajectories to shape a reward that balances geometric accuracy, temporal stability, and regularized acceptance, but runs fully label-free at test time. On the public MMAUD benchmark, training on V1 and evaluating on unseen V2/V3 domains, A3PRL reduces 3D localization error by about 19% compared to its non-RL counterpart and consistently outperforms LiDAR-only and multimodal baselines under both day and night conditions. We further show that the same policy transfers to an in-house LiDAR–RTK setup and a public multi-LiDAR SAT dataset with heterogeneous scan patterns, where it maintains accurate trajectories and stable tracks under varying sparsity, while adding less than 2 ms per frame on a 10 Hz LiDAR budget.
Paperid: 3032,   Poster  
Authors: Yuyao Zhang, Alexander Huang-Menders, Yu-Wing Tai
Title: HierEdit : Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
Abstract: Highresolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduceHierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (Local-Window MMDiT) that refines only edited regions within the original high-resolution image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (Inference Acceleration) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without requiring any specialized high-resolution training data. Extensive experiments demonstrate thatHierEditachieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing.
Paperid: 3033,   Poster  
Authors: Shengkai Sun, Zhiyong Cheng, Zefan Zhang, Jianfeng Dong, Zhihui Li, Meng Wang
Title: Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition
Abstract: Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in selfsupervised skeleton‑based action recognition. However, existing state‑of‑the‑art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre‑training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre‑training substantially but also improves downstream recognition accuracy, surpassing current state‑of‑the‑art approaches.
Paperid: 3034,   Poster  
Authors: Lingjing Kong, Shaoan Xie, Yang Jiao, Yetian Chen, Yanhui Guo, Simone Shao, Yan Gao, Guangyi Chen, Kun Zhang
Title: Learning by Analogy: A Causal Framework for Compositional Generalization
Abstract: Compositional generalization the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice.In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.
Paperid: 3035,   Poster  
Authors: Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu
Title: Video-CoE: Reinforcing Video Event Prediction via Chain of Events
Abstract: Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform finegrained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with.In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information.To address these challenges, we proposeChain ofEvents (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols.Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task.Codes and models will be released soon.
Paperid: 3036,   Poster  
Authors: Wentao Yang, FanZhen KONG, Zejian Kang, Xiangru Huang
Title: SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method
Abstract: 3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with nonlambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering.
Paperid: 3037,   Poster  
Authors: Jiaqi Yang, Wenting Chen, Xiangjian He, Yuanbai Li, Sen Yang, Linlin Shen, Xiaohan Xing
Title: H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
Abstract: Cancer survival prediction through multimodal learning that combines histopathology images with genomic data represents a promising research direction. However, current approaches still suffer from two key limitations. First, most methods operate in a Euclidean feature space, which makes it difficult to capture the intrinsic hierarchies in histopathology, where information is organized from patches to wholeslide images to patients, and in genomics, where it progresses from genes to pathways to patients. Second, they typically discretize survival times into coarse risk intervals, neglecting fine-grained ordinal relationships among samples within the same interval and thus failing to capture the continuous ranking characteristics of survival outcomes.To address these issues, we propose \ourmethod, a hyperbolic hierarchical multimodal learning framework for survival prediction. H2-SurvNet first employs ahyperbolic hierarchical information modeling(H2IM) module that maps multimodal features into a shared hyperbolic space and explicitly encodes intra-modal and inter-modal hierarchies across patches, WSIs, patients, genes, and pathways. On top of this representation, we design aTemporal Ordinal Contrastive learning(TOCL) module that models the temporal progression of survival outcomes by enforcing ordinal risk ordering through contrastive objectives, thereby promoting continuity in the learned risk scores.Extensive experiments on heterogeneous cohorts from TCGA, CPTAC, and NLST demonstrate that H2-SurvNet consistently outperforms state-of-the-art multimodal survival prediction methods and exhibits strong robustness and generalization across diverse data distributions. Source code will be released upon acceptance.
Paperid: 3038,   Poster  
Authors: Kaizhao Zhang, Tian Niu, Tianyu Liu, Chenen Guo, Zijun Xu, Qingda Hu, Wenchao Ding
Title: DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation
Abstract: Robotic manipulation from visual observations remains challenging due to the lack of 3D consistent representations that can generalize across diverse viewpoints and sensor configurations. Existing approaches often rely on masked autoencoders or neural scene representations, which fail to capture cross view correspondences. Crucially, while multiview diffusion models have recently shown tremendous success in 3D aware generative synthesis, their powerful representations offer a promising direction for achieving viewpoint robust visuomotor control. In this paper, we introduce DiffuView, a novel framework that learns unified 3D aware representations through multi-view diffusion pretraining and deploys them for imitation learning. Specifically, DiffuView models the conditional generation of target views given source observations within a diffusion framework, enabling the network to implicitly recover scene geometry and enforce view consistency. The pretrained diffusion network is then utilized as a powerful visual backbone for an action policy, allowing robust control under varying viewpoints and visual conditions. We evaluate DiffuView on two challenging benchmarks, MetaWorld and Libero. Extensive experiments in both simulation and realworld scenarios demonstrate that DiffuView achieves superior generalization, improving success rates under viewpoint shifts by nearly 20% compared with existing methods.
Paperid: 3039,   Poster  
Authors: Chenhui Gou, Ziyu Ma, Zicheng Duan, Haoyu He, Feng Chen, Liyang Liu, Bohan Zhuang, Jianfei Cai, Hamid Rezatofighi
Title: An Empirical Study on How Video-LLMs Answer Videos Questions
Abstract: Taking advantage of largescale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process—lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter’s high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.
Paperid: 3040,   Poster  
Authors: Zhengdong Hu, Chao Wang, Fengyun Rao, Jing LYU, Hehe Fan, Yi Yang
Title: PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
Abstract: This paper explores parallel thinking for Multimodal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are invalid, yet popular methods assign them the same rewards. Therefore, we propose allocating differentiated rewards to different points within the same chain-of-thought. This is implemented via a self-verification mechanism called Group Points Policy Optimization (GPPO), which combines rollout-level and point-level validation for reward assignment. On challenging benchmarks such as HallusionBench, PointThinker achieves 58.7% accuracy, improving reasoning quality and answer accuracy. Experimental results demonstrate that parallel thinking with point improves performance, and GPPO further contributes non-trivial gains.
Paperid: 3041,   Poster  
Authors: Jian Zhang, Shijie Zhou, Bangya LIU, Achuta Kadambi, Zhiwen Fan
Title: SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning
Abstract: Large visionlanguage models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physcial AI systems.
Paperid: 3042,   Poster  
Authors: Qian Jiayu, Zongxian Yang, Guanxing Chen, Pengwei Hu, KC Tan, Yan Wang, Yu-An Huang, Zhi-An Huang
Title: Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs
Abstract: Ensuring fairness in medical visionlanguage models (VLMs) is essential for equitable healthcare, yet existing models amplify biases across demographic subgroups such as race and gender. Traditional fairness mitigation approaches relying on broad distribution alignment, fall short in addressing these nuanced intersectional disparities. We propose fairness-aware relational prompting (FRP), a novel framework that reformulates prompt generation as a dynamic, fairness-aware reasoning process. FRP constructs a relational graph to capture fine-grained, sample-level similarities and employs a hyperbolic graph layer to explicitly model the hierarchical structure of intersectional identities. Leveraging hyperbolic geometry enables reasoning over complex attribute combinations, effectively reducing entrenched biases. Evaluations on the FairVLMed and Harvard-GF datasets demonstrate that FRP achieves state-of-the-art diagnostic performance, with an area under the curve of 77.50% and 85.94% respectively, while substantially improving the demographic parity difference and equalized odds difference. This work moves beyond post-hoc bias correction toward inherently fair VLM architectures, offering a scalable solution for high-stakes medical applications.
Paperid: 3043,   Poster  
Authors: Huidong Ma, Xinyan Shi, Sun Hui, Xiaofei Yue, xiaoguang Liu, Gang Wang, Wentong Cai
Title: Learned Image Compression via Sparse Attention and Adaptive Frequency
Abstract: Learned image compression (LIC) methods surpass traditional algorithms in ratedistortion (RD) performance, but still struggle to optimally balance effectiveness and efficiency. Moreover, many methods often overlook the importance of frequency-domain information. Even the few recent methods that incorporate fixed frequency transforms lack content-adaptive capabilities. Therefore, we propose an efficient spatial-frequency dual-path LIC method. Specifically, for the spatial path, we introduce Cross-Sparse Window Attention, leveraging sparse, window-conditioned global tokens to efficiently model long-range dependencies. It achieves lower computational cost and superior effectiveness than standard Window-based Multi-head Self-attention. For the frequency path, we design a content-adaptive frequency transform, employing a decomposition weight generator and learnable global weights to adaptively process multi-scale frequency components. Furthermore, we propose Denoising-as-Regularizer, a training-only module that structures and smooths the latent representation via a denoising task, enhancing reconstruction quality at zero inference cost. Experiments on the Kodak, CLIC, and Tecnick datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods in both RD performance and latency.
Paperid: 3044,   Poster  
Authors: Jiawen Li, Fei Jiang, Dandan Zhu, Aimin Zhou
Title: HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
Abstract: Unsupervised domain adaptation (UDA) for pose estimation promises transfer from synthetic to real domains but often suffers instability under domain shift. Prior work attributes this deterioration to gradient interference between source supervision and target consistency. This conflict is distinct in pose estimation, where sparse and heterogeneous supervision signals cause gradients to be highly sensitive to small localization errors and lead to unstable updates. To address these challenges, we propose HamiPose, a Hamiltonian optimization framework that transports decoupled and confidencecalibrated gradients within a unified geometry to mitigate instability. HamiPose first refines gradient interaction through keypointwise geometry decomposition, orthogonally projecting target gradients to preserve nonconflicting component. Channelwise gated alignment then calibrates the parallel component with confidence and alignment, producing decoupled, confidence-calibrated gradients. These gradients are advanced by a Hamiltonian optimizer with a symplectic integrator, providing controlled momentum that stabilizes updates. Extensive experiments demonstrate that HamiPose achieves state-of-the-art performance in UDA pose estimation while maintains strong performance under domain generalization settings.
Paperid: 3045,   Poster  
Authors: Yuru Wang, Yue Zhou
Title: Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
Abstract: Facial Action Unit (AU) detection suffers from limited annotated data, severe class imbalance, and label noise, which often result in overfitting and degraded performance. We propose a novel framework that synergizes Uncertaintyaware Transformers with Causal Intervention to address these challenges. By modeling attention weights as Gaussian distributions, our probabilistic Transformer captures inter-AU dependencies and epistemic uncertainty. An uncertainty-guided loss weighting strategy further mitigates data imbalance by adaptively emphasizing reliable predictions. Moreover, a causal intervention module is introduced to eliminate spurious correlations caused by confounders, ensuring that the learned AU relationships reflect true causality. Extensive experiments on BP4D and DISFA demonstrate that our framework achieves state-of-the-art performance with superior robustness and generalization.
Paperid: 3046,   Poster  
Authors: Shibin Mei, Hang Wang, Bingbing Ni
Title: Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
Abstract: Despite remarkable progress in textto-image diffusion models, controlling the semantic and spatial relationships between interacting instances remains a fundamental challenge. Current methods that inject spatial constraints often fail to model the intrinsic functional dependencies between entities, leading to implausible interactions. In this paper, we introduce Semantic Derivative Flow (SDF), a novel graph-guided framework that structures the diffusion process within a directed acyclic interaction graph. Our core innovation is a theoretically-motivated derivative attention mechanism, which explicitly enforces the semantic representation of a predicate to be derived from its subject, and the object from the predicate, formalizing a differentiable semantic graph. This principled approach compels the generative process to adhere to the logical chain of interaction. We further integrate a global context node and a real-time regional refinement module to ground the graph in the visual domain holistically. Extensive experiments demonstrate that our model, an instantiation of SDF, establishes a new state-of-the-art in fidelity and controllability on the HICODet benchmark. We complement our empirical results with a theoretical analysis, framing our method as structured message passing on interaction graphs, which provides a rigorous justification for its efficacy and generalization benefits.
Paperid: 3047,   Poster  
Authors: Xinguo He, Yixin Shen, Rahul Chaudhari
Title: TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
Abstract: Hand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for singleview 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using M discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the M tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extensive experiments demonstrate that TokenHand achieves comparable or superior performance to existing methods across standard benchmarks, while maintaining high efficiency in practical scenarios.
Paperid: 3048,   Poster  
Authors: xulun ye, Benyu Wu, Jie Hong, Kun Zhou
Title: Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
Abstract: The scarcity of labeled finegrained data presents a significant challenge for deep clustering. Vision-Language Models (VLMs) on existing coarse-grained datasets (characterized by high inter-class and low intra-class variance) struggle to capture the subtle distinctions essential for fine-grained categorization, leading to suboptimal clustering performance. To address this, we propose a novel framework that adapts VLMs for fine-grained clustering without requiring fine-grained labels. Our method steers the model to focus on discriminative fine-grained features by integrating a Bayesian nonparametric process with a tailored representation learning objective, which includes low-rank guidance and orthogonal guidance. This allows our model to dynamically discover clusters that reflect fine-grained categories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple fine-grained benchmarks.
Paperid: 3049,   Poster  
Authors: Mingbo Hong, Feng Liu, Caroline Gevaert, George Vosselman, Hao Cheng
Title: Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Abstract: Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in singlesource domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namelyBridge, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment,Bridgeblocks confounders' effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components.Bridgecan be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs).Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches.Code, models, and data will be publicly released.
Paperid: 3050,   Poster  
Authors: Fengyuan Zuo, Haiyan Jin, Yuanlin Zhang, Zhaolin Xiao, Bin Wang, MU YUERONG
Title: FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching
Abstract: Dark optical flow estimation (DOFE) faces critical challenges: discriminative models are less robust to noise and struggle with weakened motion patterns, while diffusion models suffer from discontinuous flow fields and low efficiency. Flow matching (FM), though efficient, remains underexplored for conditional generation in DOFE. In this paper, we propose FlowFM, the first flow matching model tailored to DOFE tasks. Instead of conventional vector field regression, FlowFM suggests estimating the global transformation path constrained by the ground truth optical flow. It generates noisy flow by mixing Gaussian noise with ground truth, then performs a onestep denoising process conditioned on the initial flow field, cost volume, and contextual features for optimal accuracy and efficiency. FlowFM incorporates an implicit Fourier denoising decoder (IFDD) for reliable motion understanding. By leveraging Fourier transform, IFDD uses amplitude to characterize motion intensity and phase to encode target spatial relationships within flow fields, then directly enhances amplitude to restore dark-caused motion information loss. Experiments show that FlowFM significantly outperforms state-of-the-art methods on the FCDN and VBOF benchmarks, setting a new performance record for DOFE.
Paperid: 3051,   Poster  
Authors: Xin Duan, Xiabi Liu, Liyuan Pan
Title: GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
Abstract: Cosegmentation aims to identify and segment common objects across a set of point clouds or images. Existing methods focus on single-modal co-segmentation. However, the limited semantics of a single modality restrict the discovery of common objects, leading to costly and labor-intensive segmentation masks. In contrast, cross-modal co-segmentation leverages both modalities, offering two key advantages: (i) additional semantic cues compensate for the absence of segmentation masks; and (ii) complementary modalities provide richer common semantics beyond the limitations of single-modality approaches. Motivated by these challenges, we introduce a novel task: unsupervised point cloud-image cross-modal co-segmentation. We tackle this problem using a coarse-to-fine approach. First, the 3D and 2D branches extract coarse common semantics from each modality, respectively. Then, a cross-modal common semantic graph purifies these features into fine-grained common semantics. Finally, 3D and 2D common semantic features are fused and mutually enhanced, without requiring geometric alignment. Experiments on two standard point cloud benchmarks and two corresponding image co-segmentation datasets demonstrate our superior performance compared to existing unsupervised state-of-the-art methods.
Paperid: 3052,   Poster  
Authors: Shuo Han, Xu Tang, Jingjing Ma, Xiangrong Zhang
Title: Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport
Abstract: Unsupervised domain adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. When source data cannot be accessed, sourcefree domain adaptation (SFDA) becomes a practical alternative. However, existing SFDA methods mainly rely on pseudo-label based self-training, which often accumulates noise and bias under large domain gaps. We propose VSFOT, a framework that leverages a pretrained Vision-Language Model (VLM) to guide optimal transport (OT) alignment between target features and source prototypes. Instead of relying on unreliable pseudo-labels, VSFOT employs VLM-derived semantic priors and an OT-based matching strategy to achieve stable and reliable adaptation. To further enhance domain alignment, VSFOT incorporates a bidirectional distillation mechanism in which the model learns semantic consistency from the VLM, while the VLM is refined using task-specific cues from the model. These two stages alternate during training. By combining the generalization ability of the VLM with the discriminative power of the task model, VSFOT achieves robust, source-free adaptation and consistently outperforms existing SFDA methods on four benchmark datasets.
Paperid: 3053,   Poster  
Authors: Fei Zhou, Xiwen Zhang, Qingqing Qiu, Lei Zhang, Wei Wei, Chen Ding, Yi Zhang, Liang Li, Xiangyu Yue, Yanning Zhang
Title: Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
Abstract: Crossdomain few-shot image interpretation (CD-FSII) has been significantly advanced by fine-tuning pre-trained visual feature models using limited labeled samples in target domains. However, profound cross-domain distribution discrepancies, along with inherent conflicts between extensive object visual appearance variations and limited annotations, trap those existing pure visual feature representations into some non-transferable short-cut patterns, thus degrading their cross-domain generalization capacity. To mitigate this problem, we present a simple yet effective cross-modal visual feature enhancement framework which primarily contributes in the following three aspects. 1) We make the first attempt to introduce linguistic descriptions of image attributes to regulate the pre-trained visual feature model for specific target image adaptation. Specifically, image-level attributes (e.g., object appearance in individual images) and domain-level attributes (e.g., overall style and background characteristics of the dataset) are extracted using a pre-trained image captioning model and a large language model (LLM), respectively, to construct comprehensive linguistic characterizations. 2) A lightweight residual cross-attention scheme is developed to seamlessly embed linguistic descriptions of image attributes into visual feature representations, thereby compensating for the limitations of purely visual cues in capturing cross-domain transferable high-level semantic characteristics. 3) The proposed framework is task-agnostic and can be seamlessly integrated with off-the-shelf pre-trained visual feature models. It demonstrates superior generalization performance compared to several state-of-the-art methods across multiple CD-FSII benchmarks, including image classification, semantic segmentation, and object detection. We will release all code and data to facilitate further research.
Paperid: 3054,   Poster  
Authors: Tianle Shen, Fang Yan, Xiaofan Zhang
Title: SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
Abstract: Achieving spatial alignment between wholeslide images (WSIs) across stains remains highly challenging due to extreme resolution, tissue fragmentation, and large nonlinear deformations. Conventional registration pipelines depend on global pre-alignment and spatial consistency, which often collapse under such distortions. We present SAR2Net, a framework that learns spatially anchored representations and reformulates cross-stain alignment as a region-level feature retrieval problem. Instead of estimating explicit transformations, SAR2Net learns pointwise representations encoding the relative spatial relationships to tissue landmarks. Given landmarks and arbitrary coordinates, it predicts spatially anchored features that serve as deformation-invariant descriptors of tissue topology. A multi-stage retrieval framework then establishes correspondences between slides, even when global alignment is infeasible. Experiments on biopsy-oriented HE-IHC datasets show that SAR2Net achieves robust region-level alignment under severe tissue distortions, outperforming previous registration methods.
Paperid: 3055,   Poster  
Authors: Jingjie Shang, Tengyu Ma, Heng Zhang, Jinyuan Liu, Risheng Liu, Yuan Wang, Xiaochen Bo
Title: Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework
Abstract: MultiExposure Fusion (MEF) seeks to generate a single high-quality image from multiple inputs captured at different exposure levels. Despite substantial progress, most existing approaches depend on statistical metrics that poorly reflect human perceptual preferences. Electroencephalography (EEG) provides a direct physiological window into human cognition, yet its use in low-level vision remains limited due to scarce paired data and the absence of bio-signals during inference. We address these challenges through two key contributions. First, we introduce Cog-Expo, the first dataset capturing human cognitive responses to multi-exposure stimuli, establishing a bridge between neuroscience and computational photography. Second, we propose a bi-level coupled learning framework that leverages this cognitive information without requiring it during inference. A Mental Integrated Transformer serves as the Teacher, incorporating cognitive priors to guide visual feature learning, while a lightweight Student is trained to approximate these cues using only image inputs. Through bi-level optimization, the Teacher learns inherently distillable representations, enabling the Student to emulate cognitive guidance efficiently. Extensive experiments confirm that our method achieves state-of-the-art fusion performance and aligns more closely with human perception.
Paperid: 3056,   Poster  
Authors: Haokun GUI, Senqiao Yang, Mingkang Zhu, Meng Chu, WU Sitong, Changsheng Lu, Zihao Wang, Zhuotao Tian, Jiaya Jia
Title: VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image
Abstract: The "thinkwith-image” paradigm has recently gained traction for complex visual reasoning tasks. However, existing approaches often struggle with inference inefficiency due to a fixed number of redundant reasoning steps, as well as training instability.This challenge primarily arises from the direct use of standard reinforcement learning policies, which do not incorporate improvements for the think-with-image multi-turn conversational scenario.To address this challenge, we propose VisionLeaf, an entropy-guided, tree-based reasoning framework. Unlike conventional GRPO, where all nodes expand from the root and each leaf has only a single branch, our method grows the reasoning tree from the leaf nodes and selects the most valuable nodes based on entropy for thorough rollout exploration. This leaf-first expansion naturally aligns with the hierarchical nature of multi-step image analysis. Without modifying any model or training data, our VisionLeaf achieves a 4.2% performance improvement on benchmarks such as VSTAR and HRBench, while reducing the number of inference rounds by nearly half—demonstrating significant gains in both accuracy and speed. All our code will be released.
Paperid: 3057,   Poster  
Authors: Qitong Yang, Mingtao Feng, Zijie Wu, Huixin Zhu, Weisheng Dong, Yaonan Wang, Ajmal Mian
Title: Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
Abstract: 3D shape generation has become increasingly important for graphics and vision applications. Current partaware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H^2MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H^2MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an adaptive tree-structured neural network to loosen the constraint of jointly generating nodes and edges in previous hyperbolic diffusion. (3) Hyperbolic Diffusion Model Solver: We leverage higher-order Riemannian gradient on hyperbolic manifolds for designing a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. Extensive experiments demonstrate that our method achieves superior quality and efficiency. Code will be public.
Paperid: 3058,   Poster  
Authors: YUANSHEN GUAN, Ruikang Xu, Chang Chen, Yinuo Liao, Dehua Song, Fenglong Song, Zhiwei Xiong
Title: FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
Abstract: Inverse tone mapping (ITM) becomes significantly harder when the SDR input is produced by local tone mapping, which jointly applies global radiometric compression and spatially varying adaptations that distort dynamic range, contrast, and channelwise color ratios. Existing ITM methods ignore this degradation structure and either regress HDR values directly or rely on a single-channel gain map, which scale luminance only and cannot restore the compressed dynamic range and wide color gamut.We introduce FastGaMer, a structured and resolution-agnostic ITM framework that explicitly mirrors this degradation process. Instead of regressing HDR values, we reconstruct a color gain map, which preserves per-channel amplification, simplifies learning, and enables proper gamut extension. Local and global degradations are inverted separately using dynamic bilateral grids and learnable 3D LUTs, followed by a lightweight neural modulator for global refinement and coherence. All high-resolution operations are network-free, yielding exceptional efficiency.To support color-GM supervision under realistic local TMO degradations, we create a dataset of over 8,000 4K SDR–GM pairs with an additional real-captured test set. FastGaMer outperforms prior lightweight ITM methods by +1.4 dB PQ-PSNR, reduces runtime by 70%, and processes 4K images in only 6.2 ms, achieving both high accuracy and real-time performance.
Paperid: 3059,   Poster  
Authors: Guanjie Wang, Chen Chen
Title: GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
Abstract: The recent success of artificial intelligence motivates many nonprofessional users to train their own models. Those users often resort to cloud training services, seeking to obtain a sufficiently accurate model at a modest cost, for which properly setting up the learning rate and batch size is crucial. While various Hyper-parameter Optimization (HPO) methods have been proposed in that regard, they largely act based on heavy-weight validation signals, being inefficient in the overall cost. We find that the model training process can be viewed as a two-dimensional voting process---with gradients for different iterations and from different samples; moreover, to attain cost-efficient training is to ensure that the gradient redundancy is within a proper range which is similar across diverse models. We further introduce GR-Gauge, a general method that gauges the gradient redundancy to instruct HPO decisions like configuration searching and trial termination. Extensive experiments demonstrate that GR-Gauge can help attain near-optimal accuracy in much less time than existing methods.
Paperid: 3060,   Poster  
Authors: Derong Jin, Xiyi Chen, Ming Lin, Ruohan Gao
Title: SonoWorld: From One Image to a 3D Audio-Visual Scene
Abstract: Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audiovisual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation.
Paperid: 3061,   Poster  
Authors: yongjian liao, Xu Zou, Wenjun Chen, Huixuan Li, Xiaoen Xie, Chunxi Li, Shixiang Huang, Gang Zhang, Jiahuan Zhou, Sheng Zhong, Luxin Yan
Title: MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
Abstract: Although 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing realworld images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS—a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion modeling to achieve dynamic scene reconstruction. To predict Gaussian changes during the exposure time, we designed motion-aware networks for static and dynamic Gaussians, thereby synthesizing virtual blurred images. Finally, we utilize the results from the deblurring network and the synthesized images to supervise 4D reconstruction collaboratively. Extensive experiments demonstrate that MSCD-GS can effectively reconstruct high-quality dynamic scenes from blurred image inputs, with performance surpassing existing methods.
Paperid: 3062,   Poster  
Authors: Wenbin Yin, Junkang Zhang, Sunzhe Yang, Faming Fang, Guixu Zhang
Title: LightRR: A Lightweight Network for Single Image Reflection Removal
Abstract: Singleimage reflection removal (SIRR) is a highly ill-posed and computationally demanding problem. Existing CNN or Transformer-based methods often rely on large receptive fields and heavy computation, limiting their deployment on resource-constrained devices. To address this, we propose LightRR, a lightweight yet effective reflection removal network that unifies a wavelet-based mechanism and State Space Modeling (SSM).Specifically, we introduce an Asymmetric Frequency Mamba Block (AFM), which leverages the Discrete Wavelet Transform (DWT) to decompose features into low- and high-frequency components. This allows for targeted modeling of frequency-specific dependencies via Mamba-based state space dynamics.This design not only captures long-range context efficiently but also reduces spatial resolution and computation while preserving critical details.Furthermore, a knowledge distillation-enhanced encoder allows the network to inherit the representational power of large pre-trained models during training, enabling lightweight inference.Extensive experiments on multiple real-world benchmarks demonstrate that LightRR achieves performance comparable to state-of-the-art methods, while using only 3.01% of the parameters and 5.22% of the FLOPs (vs. RDNet), highlighting its superior balance between accuracy and efficiency.
Paperid: 3063,   Poster  
Authors: Ran Zuo, Haoxiang Hu, Chenxi Pei, Yanxuan Liu, Wenwen Qiang, Fang Liu, Xiaoming Deng, Cuixia Ma, Yong-Jin Liu
Title: SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior–Guided Multimodal LLMs
Abstract: Transforming sparse, partial pixel sketches from diverse media into complete, editable vector drawings is essential yet underexplored in digital creation. Prior methods either generate from scratch or inpaint local gaps without predicting global structure, leading to coarse contours and limited detail. To address this, we introduce SketchRevive, a two‑stage framework for fine‑grained pixel‑to‑vector sketch completion that couples diffusion‑based pixel completion with MLLM‑driven refinement and vectorization to produce coherent, detail‑faithful SVG results. Specifically, we first construct a practical benchmark by augmenting stroke‑annotated sketches from paper and whiteboards. Stage I trains a diffusion model with a line‑distribution head to predict per‑pixel stroke presence, producing structural and appearance consistent completions. Stage II finetunes an MLLM for structure‑aware SVG vectorization with iterative refinement, optimized by instance‑level stroke attribute similarities. To align key clues e.g. spatial structure, appearance details across both stages, we introduce a diffusion-prior aggregated encoding module by injecting multi‑scale UNet features from Stage I into the MLLM’s visual embeddings and using line prediction logits for token compression to prioritize informative tokens. Experiments indicate that SketchRevive completes topology‑coherent vector outputs with high fidelity and recognizability while preserving user intent, suitable for interactive creation and artistic design.
Paperid: 3064,   Poster  
Authors: Xiaole Zhao, Qingsong Pang, Xiaobo Zhang, Xun Xu, Xun Gong, Yan Yang, Tianrui Li
Title: Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
Abstract: Zeroshot image denoising has gained prominence in recent years, as it inherently relies on the intrinsic priors of images rather than learning from external data. Nevertheless, most existing methods either fail to fully exploit global priors, or do not properly preserve the fine-grained details governed by local priors. In this work, we propose a novel framework of pseudo sample generation for zero-shot denoising guided by local and global image priors. Specifically, we propose a well-crafted down-sampler based on gradient merging and grouping within a small window to generate down-sampled samples by exploiting spatial locality. Meanwhile, a global random sampler conditioned on a Gaussian distribution is devised to incorporate the nonlocal self-similarity of natural images. These two samplers build a new paradigm of pseudo sample generation powered by both local and global priors, which is termed as Zero-Shot Hybrid Prior-guided Denoising (ZS-HPD). Considering that noise is more likely to affect high-frequency details, we also present a simple yet effective loss that works in the Fourier domain and applies discriminative weights to distinct spectral bands. Numerous experiments on benchmark datasets have demonstrated the superiority of our ZS-HPD over existing advanced methods.
Paperid: 3065,   Poster  
Authors: Xin Li, Shujun Tian, Tao Lu, Han Bao, Zonghui Wang, Wenzhi CHEN
Title: Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
Abstract: Diffusion models (DMs) have recently achieved remarkable success across diverse modalities, including highfidelity image and video synthesis.However, their inherent step sequential denoising process introduces substantial cumulative latency, which significantly degrades user experience. While existing multi-GPU parallelization motheds can alleviate latency, they often incur prohibitive GPU-GPU communication overhead, offsetting much of the performance gain.We present Otil (Only Transmit Informative Latents), a communication-efficient parallel framework for accelerating diffusion inference.% Otil can minimizes redundant data exchange across GPUs while preserving generation quality.Our key insight is that latent activations change only marginally between consecutive denoising process. Leveraging this property, Otil identifies and synchronizes only the most informative latent sub-blocks and introduces a dynamic polling mechanism that periodically revisits all spatial regions, ensuring complete coverage without unnecessary communication. The framework is fully plug-and-play and remains compatible with fast sample and architectural acceleration algorithms, without requiring any retraining or architectural modification.Otil reduces GPU–GPU communication up to 87.5% compared with SOTA parallelism methods, achieving 1.8× speedup on two GPUs with Stable Diffusion v1.5 and 2.6× on four GPUs with Stable Diffusion XL. When combined with few-step samplers (30 steps) and LoRA models, the acceleration further increases to 2.46×–2.84× on 2 GPUs. These demonstrate the strong potential of Otil for scalable and efficient multi-GPU diffusion inference while preserving generation fidelity.
Paperid: 3066,   Poster  
Authors: Wei-Jin Huang, Yue-Yi Zhang, Yi-Lin Wei, Zhi-Wei Xia, Juantao Tan, Yuan-Ming Li, Zhilin Zhao, Wei-Shi Zheng
Title: Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations
Abstract: Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of highquality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data.This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry.By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.
Paperid: 3067,   Poster  
Authors: Hakyeong Kim, Ruicheng Wang, Chengtang Yao, Jiaolong Yang, Min H. Kim
Title: Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
Abstract: Direct Timeof-Flight (dToF) sensors provide highly accurate metric depth and are more robust than indirect ToF systems in challenging real-world conditions. However, their high manufacturing cost and limited photodiode array size produce depth maps that are extremely sparse, low-resolution, and noisy, making them unsuitable for VR/XR, robotics, and 3D perception tasks that require dense metric depth. Existing monocular and depth completion methods struggle to handle the unique sampling patterns and hardware artifacts of dToF devices, and their performance often deteriorates significantly under severe sparsity or noise. We present a generalizable framework for dense metric depth completion from sparse dToF measurements, capable of operating across diverse sensor types, sparsity levels, and noise conditions. Our model employs a depth-guided dual-branch Vision Transformer encoder that processes RGB images and sparse dToF measurements separately, while a masked joint attention module allows depth tokens to reliably guide image features without being overwritten by them. A lightweight decoder reconstructs dense metric depth efficiently, without diffusion-based or refinement-heavy post-processing. To address the scarcity of paired training data, we introduce a comprehensive dToF simulation pipeline that reproduces the characteristics of flash, sub-VGA flash, and rotating sensors, including hardware-induced degradation, irregular sparsity, and realistic noise distributions. Trained entirely on synthetic data, our model achieves strong zero-shot generalization across 6 datasets and 2 real dToF devices, outperforming state-of-the-art approaches in both accuracy and computational efficiency. This establishes a robust and practical solution for dense metric depth completion from sparse direct ToF sensors.
Paperid: 3068,   Poster  
Authors: Zhi Tu, Liangkun Niu, Tianyi Zhang
Title: TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
Abstract: Recent research has investigated the use of large language models (LLMs) to generate traffic scenarios for autonomous driving. However, pretrained LLMs often fail to align with realworld traffic distributions. In this work, we present TrafficAlign, an automated framework that synthesizes traffic scenarios based on real-world driving videos, performs data validation, and aligns LLMs with the synthesized scenarios. The evaluation shows that traffic scenarios generated by TrafficAlign are highly effective, revealing up to 10.8% more collisions on average across three autonomous driving models than state-of-the-art methods. Furthermore, fine-tuning these driving models with TrafficAlign-generated scenarios significantly reduced collision rates by 36.1% compared with the original models. A qualitative study using traffic datasets from six geographically diverse regions shows that TrafficAlign-generated scenarios exhibit strong alignment with corresponding traffic distributions in these regions.
Paperid: 3069,   Poster  
Authors: Karlis Martins Briedis, Markus Gross, Christopher Schroers
Title: Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
Abstract: Recent optical flow estimation methods often employ local cost sampling from a dense allpairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details.To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.
Paperid: 3070,   Poster  
Authors: Jingzhou Shen, Tianya Zhao, Xuyu Wang
Title: A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
Abstract: In this paper, we introduce Geometric Algebra–Informed 3D Gaussian Splatting (GAIGS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric-algebra–based attention mechanism to explicitly model ray–object interactions in complex propagation environments. GAI-GS encodes joint spatial–electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design renders ray tracing for wireless propagation physically grounded, with token interactions that respect EM constraints including multipath, path-dependent attenuation, and reflection/diffraction. Through extensive evaluations on on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.
Paperid: 3071,   Poster  
Authors: Yue Ma, Frederick W. B. Li, Xiaohui Liang
Title: Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction
Abstract: Stochastic human motion prediction aims to forecast future motion distributions. Although recent studies have achieved strong performance in terms of accuracy and diversity, they often overlook plausibility (e.g., resulting in physically unrealistic predictions) and uncertainty quantification, which is essential for realworld applications and downstream tasks. To address these issues, we propose a latent flow-based model equipped with a data-driven Gaussian mixture prior that more effectively disentangles diverse human behaviors than conventional single-modal priors. This prior is derived from patterns in the training data without requiring additional annotations. Furthermore, the fully invertible nature of our model enables natural uncertainty quantification through tractable likelihood computation. Experiments on the Human3.6M and AMASS datasets demonstrate that our approach achieves state-of-the-art performance in both accuracy and plausibility, while also providing reliable uncertainty estimates.
Paperid: 3072,   Poster  
Authors: Qinbo Zhang, Yanhang Shi, Ziyi Zhang, Hao Wang, Sai Qian Zhang, Jian Li
Title: Stealing Split Learning Bottom Models by Recovering Embedding Geometry
Abstract: Vertical federated learning (VFL) trains models by splitting computation across clients and a server that only exchange intermediate embeddings. Recent work shows that a server even if honestbut-curious can steal a client’s bottom model by querying the system and regressing on the returned embeddings, and in response, defenses perturb or decouple the embedding channel. We show these defenses remain vulnerable. We propose VENOM, a geometry-aware stealing attack. VENOM first learns a contrastive space over server-observed embeddings, then builds a neighborhood graph and trains a surrogate bottom model to match targets and respect local geometry via a neighbor-matching loss alongside pointwise and feature-shape alignment. This strategy preserves the relational structure that defenses fail to erase, effectively recoupling the embeddings produced by multi-branch and noise-based defenses. Across six datasets, VENOM consistently outperforms standard stealing methods under no defense and multiple defenses, and remains effective with out-of-distribution (OOD) auxiliary data.
Paperid: 3073,   Poster  
Authors: Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu
Title: OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data
Abstract: Textto-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as ``frills''. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.
Paperid: 3074,   Poster  
Authors: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ángel Bautista, David Berthelot, Joshua Susskind, Shuangfei Zhai
Title: STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
Abstract: Normalizing flows (NFs) are endto-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.
Paperid: 3075,   Poster  
Authors: Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning
Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT5 and Grok 4.These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence.The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks.Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles.However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution.This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages:vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution.Building on this insight, we introduce two synergistic strategies:(1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and(2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models.
Paperid: 3076,   Poster  
Authors: Enhuai Liu, Yunke Wang, Changming Sun, Chang Xu
Title: D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
Abstract: Video diffusion models achieve impressive visual fidelity but remain computationally prohibitive for realtime or interactive generation due to their sequential denoising process. Recent caching methods accelerate inference by reusing outputs across timesteps, typically estimating each new output from the first-order residual, which is the difference between adjacent model predictions.To mitigate the accumulated error in caching methods, we propose D2Cache, a training-free method that leverages the smoothness of second-order residual delta, which is temporal differences between consecutive first-order residuals, to predict future timesteps more accurately. We theoretically show that this second-order correction improves prediction accuracy and effectively suppresses cumulative errors. Moreover, D2Cache adaptively scales second-order deltas using error estimates derived from timestep embeddings, maintaining accuracy across varying cache intervals.Empirically, D2Cache outperforms the state-of-the-art TeaCache across four video diffusion models (Latte, Open-Sora, LTX-video, and Wan2.1) at comparable acceleration rates, showing even larger gains under higher acceleration settings.
Paperid: 3077,   Poster  
Authors: Rahul Mysore Venkatesh, Klemen Kotar, Lilian Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins
Title: Physical Object Understanding with a Physically Controllable World Model
Abstract: A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract coherent physical objects and articulated object subparts, achieving state-of-the-art results on SpelkeBench and DragAMove. Having discovered these objects, our world model can manipulate them in 3D, emerging as the strongest performer on 3DEditBench. Finally, we demonstrate that physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.
Paperid: 3078,   Poster  
Authors: Zixun Wang
Title: Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
Abstract: The Fmeasure is a widely used metric in multi-label classification, where multiple labels are predicted simultaneously for a single instance. The optimal prediction rule for F-measure requires estimating q^2+1 probabilities, where q is the number of labels. Existing approaches train q multinomial estimators (multi-class classifiers) to directly estimate these probabilities, followed by a matrix multiplication for making predictions. However, this method has two major drawbacks. First, the matrix multiplication incurs a time complexity of \mathcalO(q^3), which becomes computationally expensive for large q. Second, training multinomial estimators is challenging due to the sparsity of the underlying distributions, which results from the inherent imbalance in multi-label datasets and is further exacerbated by the label transformation required by the method itself. In this paper, we first demonstrate that matrix multiplication can be reformulated as a series of convolutions by exploiting a special structure in the matrix. These convolutions can then be efficiently computed using the Fast Fourier Transform (FFT), reducing the time complexity to \mathcalO(q^2\log q). For example, on the COCO dataset, matrix multiplication requires 27 seconds, while FFT takes only 1 seconds, resulting in a 27x speedup. To avoid multinomial label transformation, we propose an indirect sampling-then-estimation approach to estimate the required probabilities. This method trains only q binary estimators instead of multinomial ones, thereby alleviating the sparsity issue, simplifying the training process, and improving performance. We provide theoretical guarantees for the consistency of the proposed sampling-based method and demonstrate its effectiveness through extensive experiments on diverse datasets.
Paperid: 3079,   Poster  
Authors: Yi Wang, Ningze Zhong, Zhiheng Fu, Longguang Wang, Ye Zhang, Yulan Guo
Title: MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning
Abstract: Offline MultiAgent Reinforcement Learning (MARL) is critical for coordinating multiple agents in costly and unsafe environments, yet existing methods struggle from high sensitivity to reward functions and weak generalization to new goals, limiting its practical impact. Inspired by single-agent Offline Goal-Conditioned RL (OGCRL), we propose the first goal-conditioned offline MARL framework, extending OGCRL to multi-agent settings under both fully decentralized and centralized training with decentralized execution (CTDE) paradigms. To systematically evaluate this setting, we introduce MangoBench, the first fully cooperative multi-goal benchmark for MARL, covering 3 environments, 4 agent types, and 47 tasks, designed to assess joint-control locomotion, synchronous and asynchronous bimanual manipulation, and robustness to high-dimensional inputs. Extensive experiments demonstrate that our baselines achieve strong multi-goal generalization under sparse rewards, yet no method dominates all tasks, revealing both the intrinsic complexity and the unexplored potential of goal-conditioned offline MARL.
Paperid: 3080,   Poster  
Authors: Jinfu Fan, Jiangnan Li, Xiaohui Zhong, Kangrui Ren, Zhencun Jiang, 福建话 赣方言, Tianhao Gu, Linqing Huang
Title: Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
Abstract: Partial label learning (PLL) is a weakly supervised learning, where each instance is assigned a set of candidate labels and only one is true. However, due to potentially inaccurate annotations, existing PLL algorithms disambiguate labeling by minimizing the prediction loss, which leaves the model unaware of its prediction credibility. To address this issue, this paper proposes the evidential deep partial label learning (EDPLL) to quantify disambiguation uncertainty, aiming to achieve candidate label disambiguation and reliability prediction. Firstly, we extend the evidence modeling mechanism to PLL, treating the candidate label set as the source of evidence for the label hypothesis, and using belief and credibility to model classification uncertainty, thereby guiding a more reliable disambiguation process. Meanwhile, we propose the expectation calculation under the Dirichlet distribution of non-candidate labels, which suppresses the output of non-candidate labels by using consistency regularization to further improve the accuracy of disambiguation. Furthermore, a conflict-aware regularization is proposed to evaluate the degree of conflict, which measures the consistency between instances within the class by combining the differences in the distribution of prediction results and model uncertainty, and thus improves the robustness of the model. In addition, this paper theoretically analyzes our method from the perspective of the Expectation-Maximization (EM) algorithm, and the ED-PLL is compatible with any deep network or stochastic optimizer. Experiments on benchmark and real datasets verify the effectiveness of the proposed algorithm.
Paperid: 3081,   Poster  
Authors: Guangchen Shi, Yirui Wu, Zhu Wei, Tao Wang, Hao Zhang, Bo Li, Tong Lu
Title: Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation
Abstract: Fewshot Semantic Segmentation (FSS) aims to segment objects of novel categories given only a handful of labeled examples. However, existing methods often rely on complex category-specific modeling, resulting in high computational cost and limited generalization under low-data regimes. To address these challenges, we propose a Bayesian Probabilistic Network (BPNet) that reformulates FSS as a composition of three interpretable components: a prior, a likelihood, and a class-consistency term. Specifically, an efficient Segment Anything Model (SAM) is employed to generate fragmented prior regions for the query image, while both the likelihood and the consistency terms are estimated by a lightweight Class-Agnostic Localization Model (CALM). CALM simultaneously predicts the class consistency between support-query pairs through a binary classification head and estimates the likelihood by localizing the target region in the support image. By evaluating SAM-generated regions in parallel, CALM can efficiently identify the core region, thereby transforming the segmentation problem into a simple binary classification task. Furthermore, to mitigate the semantic incompleteness of SAM proposals, we introduce an attention-based Semantic Completion Module (SCM), which leverages local and global context cues to integrate fragmented regions into semantically complete masks. Extensive experiments demonstrate that BPNet achieves state-of-the-art performance while maintaining high efficiency.
Paperid: 3082,   Poster  
Authors: Xilin He, Xiaole Xian, Xiangyu Yue, Muhammad Haris Khan
Title: StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks
Abstract: Style generation has made significant progress through diffusion models. Recent efforts have explored reinforcement learning with humanpreference reward models to enhance diffusion models for general downstream applications. However, we identify a critical limitation: existing human-preference reward models struggle to effectively perceive image style, resulting in suboptimal performance after reinforcement fine-tuning. To address this, we first introduce a large-scale style reward modeling dataset comprising 400K paired samples spanning 1,000 diverse style categories, augmented with textual instructions and style reward annotations.We then propose StyleDoctor, a novel style perception reward model capable of jointly evaluating style consistency between paired images and style-text alignment. StyleDoctor outperforms existing style perception models in both style retrieval and generation tasks. Extensive quantitative and qualitative experiments demonstrate the superiority of StyleDoctor over competing approaches, showcasing its efficiency and versatility in style-conditioned generation. Our dataset and code will be made public upon acceptance.
Paperid: 3083,   Poster  
Authors: Chenshuang Zhang, Kyeongseon Kim, Chengxin Liu, Tae-Hyun Oh
Title: SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
Abstract: Unlike environmental sounds that mainly indicate event occurrence (e.g., dog barking), human speech carries rich semantics and temporal structures. Despite the advancement of audiovisual large-language models (LLMs) in video understanding, it remains unexplored whether current models can accurately align speech contents with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs.Our benchmark diagnoses speech–vision hallucinations from two complementary perspectives: semantic and temporal. Experimental results demonstrate that most advanced audio-visual LLMs struggle with aligning speech content with corresponding visual signals. Our work uncovers a fundamental limitation of current audio-visual LLMs and highlights the need for speech-aware and grounded speech-video perception and comprehension. Code will be released upon acceptance.
Paperid: 3084,   Poster  
Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Chaehyun Kim, Minkyeong Jeon, Jisang Han, Kazumi Fukuda, Takuya Narihira, HYUNAH KO, Junsu Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim
Title: Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
Abstract: Reconstructing and understanding 3D scenes from sparse views in a feedforward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction, achieving superior memory efficiency and feature fidelity compared to existing methods. All of our code will be made publicly available.
Paperid: 3085,   Poster  
Authors: Zheng Zhang, Qinchuan Zhang, Yuteng Ye, Zhi Chen, Penglei Ji, Mengfei Li, Wenxiao ZHANG, Yuan Liu
Title: MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
Abstract: Generating highquality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.
Paperid: 3086,   Poster  
Authors: Buzhen Huang, Chongyang Xu, Wentao Tang, Yuan Shu, Jingyi Ju, Binghui Zuo, Yangang Wang
Title: Occluded Human Body Capture with Frequency Domain Denoising Prior
Abstract: Monocular human motion capture in occlusion scenarios presents significant challenges. Although a few works have explicitly considered the occlusion problem, imagebased methods are unreliable due to the lack of temporal constraints while video-based approaches cannot gain sufficient knowledge from time domain motion priors to address long-term occlusions. However, occluded human motion typically exhibits periodic patterns and consistent momentum. Inspired by this observation, we exploit reliable image observations in frequency domain and formulate the motion capture task as a wavelet coefficients selection process. Specifically, we first construct probabilistic distributions for the occluded 2D keypoints, and then introduce a frequency domain diffusion model to refine the distributions by learning long-term periodic information and physical momentum with Discrete Wavelet Transform (DWT). Consequently, the learned denoising prior can select valid wavelet components to facilitate the 3D motion capture with a 3D decoder. By employing a joint reprojection strategy, we can also use the same diffusion process to train the 3D decoder. To further promote human occlusion-related tasks, we also present the first 3D occluded motion dataset, OcMotion, which serves as a new benchmark for both training and evaluation. Experimental results demonstrate that our method can produce accurate and coherent human motions from occluded videos. The dataset and code will be publicly available.
Paperid: 3087,   Poster  
Authors: Seunghwan Choi, Jooyeol Yun, Youngdo Lee, Jaegul Choo
Title: Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
Abstract: Textto-image models are increasingly utilized in design workflows, but articulating nuanced design intentions through text remains a challenge. This work proposes a method that extracts a visual attribute from a reference image and injects it directly into the generation pipeline. The method optimizes a text token to exclusively represent the target attribute using a custom training prompt and two novel embeddings: distilled embedding and residual embedding. Through this approach, a wide range of attributes can be extracted, including the shape, material, or color of an object, as well as the camera angle of the image. The method is validated on various target attributes and text prompts drawn from a newly constructed dataset. The results show that it outperforms existing approaches in selectively extracting and applying target attributes across diverse contexts. Ultimately, the proposed method enables intuitive and controllable text-to-image generation, streamlining the design process.
Paperid: 3088,   Poster  
Authors: XUSHENG LIANG, Lihua Zhou, Nianxin Li, miao xu, Ziyang Song, Dong Yi, Jinlin Wu, Jiawei Ma, Hongbin Liu, Zhen Lei, Jiebo Luo
Title: Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation
Abstract: VisionLanguage Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
Paperid: 3089,   Poster  
Authors: Shihao Shan, Hongying Liu, Fanhua Shang, Liang Wan, Jingjing Deng
Title: Beyond the Static-World: Lifelong Learning for All-in-One Medical ImageRestoration
Abstract: Allin-One Medical Image Restoration (MedIR) models offer a promising path towards generalized medical imaging intelligence but face two critical spatiotemporal challenges: 1) Spatial Modality Interference, where conflicting gradients from diverse modalities (e.g., MRI, CT, PET) degrade performance; and 2) a Temporal Static-World Assumption that ignores the continual data streams in real-world clinical settings, leading to catastrophic forgetting. To address this dual challenge, we propose Resilient On-the-fly Medical Enhancement(ROME), a novel lifelong learning framework governed by a "Disentangle-Optimize-Consolidate" paradigm. ROME first resolves the foundational modality conflict via the Modality-Invariant Disentanglement via Adversarial Balancing(MIDAB) module. It establishes a strategic "adversarial balance" between a "content preservation force" and a "modality erasure force" to optimize a disentangled, unified feature manifold. Building on this stable foundation, the Adaptive Feature Consolidation(AFC) module combats forgetting. AFC dynamically locates an optimal feature consolidation point via a prediction network, enforced by a novel Diversity Loss to ensure robust continuous learning. Experiments demonstrate that ROME not only achieves SOTA performance in static settings but also exhibits superior resilience in rigorous domain-incremental benchmarks, reducing the average catastrophic performance degradation by over 10%.
Paperid: 3090,   Poster  
Authors: Wenliang Zhong, Rob Barton, Lucas Goncalves, Kushal Kumar, Feng Jiang, Hehuan Ma, Yuzhi Guo, Vidit Bansal, Karim Bouyarmane, Junzhou Huang
Title: Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
Abstract: Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a GuidelineDriven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce MST-based LLM Traversal that selectively applies LLM reasoning for complex semantic judgments, reducing computational costs. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. We demonstrate superior performance across various clustering tasks, consistently outperforming specialized state-of-the-art methods.
Paperid: 3091,   Poster  
Authors: Yuan Gao, Tianle Ding, Yuqing Zhu, Tianzhu Zhang
Title: EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
Abstract: Event keypoint detection has garnered significant attention due to its crucial role in extracting spatial relationships between matched keypoints, which are fundamental for various computer vision tasks. However, achieving robust event keypoint detection remains challenging the difficulty in balancing the exploitation of event information and compatibility with established algorithms. Moreover, the limited use of covisible information often results in excessive keypoint detection in non-matching regions, leading to incorrect matches. To address these challenges, we propose a novel Co-visible Focused 3D-guided 2D Event Keypoint Detection Network (EV-CGNet), which mainly consists of a 3D-guided 2D feature prototype learning (G2PL) module and a co-visible region-focused detector and descriptor learning (CDDL) module. The proposed method enjoys several merits. First, the proposed G2PL module can enhance event frame feature prototypes by recovering motion information with guidance from event points. Second, the proposed CDDL module can direct keypoint detection toward co-visible regions and ensure accurate matches. Comprehensive experimental evaluations on six challenging benchmarks show that our method surpasses state-of-the-art event keypoint detection method significantly.
Paperid: 3092,   Poster  
Authors: Zheng Fang, Lichuan Xiang, Xu Cai, Bing Wang, Bo Yang, Hongkai Wen
Title: DynFusion: Rethinking Condition Fusion for Adaptive Multi-condition Text-to-Image Generation
Abstract: Textto-image diffusion models have achieved remarkable progress, generating visually realistic and semantically coherent images from textual prompts. However, natural language alone lacks the precision required for design-centric applications that demand strict spatial and structural fidelity—particularly when representing complex concepts that integrate multi-level information, such as product or scene design.To address this limitation, controllable diffusion frameworks introduce auxiliary conditions (e.g., depth, edge, or reference images) to guide the generative process. Models like ControlNet and IP-Adapter effectively inject such priors, improving structural or appearance alignment.Yet, real-world design tasks rarely depend on a single type of condition. They often require simultaneous integration of multiple heterogeneous cues—for instance, preserving spatial layout from depth maps, structural outlines from edge maps, and stylistic attributes from reference images. Current approaches either handle only one condition or naively stack multiple ones, resulting in computational inefficiency and conflicting guidance that degrade generation quality.This multi-condition inconsistency forms a critical bottleneck for applying diffusion models to real-world design workflows, motivating our proposed framework. We propose a data-driven adaptive condition fusion mechanism for multi-conditional diffusion. Our method introduces a novel condition adaptation module that dynamically selects and fuses subsets of conditions based on the diffusion timestep, task characteristics, and feature injection position. This adaptive strategy harmonizes diverse structural and appearance priors, achieving controllable yet flexible generation in complex design scenarios. Experiments demonstrate significant improvements in fidelity, consistency, and controllability across multi-condition tasks, establishing a new direction for practical, detail-preserving diffusion-based design generation.
Paperid: 3093,   Poster  
Authors: Gengze Zhou, Tianyu Wang, Soo Ye Kim, ZHIXIN SHU, Xin Yu, Yannick Hold-Geoffroy, Sumit Chaturvedi, Qi Wu, Zhe Lin, Scott Cohen
Title: Lightmover: Towards Precise and Efficient Control for Light Movement
Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without rerendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. For training our framework, we construct a scalable rendering pipeline that can generate large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent to the original image. \ours enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
Paperid: 3094,   Poster  
Authors: Liao Kailun, Jianfeng Yang, Tao Tao, Wu Wenfei, Jiaming Jiang, Jinsheng Xiao
Title: Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection
Abstract: Unsupervised Anomaly Detection (UAD) is crucial for industrial quality control.Many existing embeddingbased methods rely on a single-prototype assumption, learning, for instance, a compact hypersphere to enclose all normal features. However, this strategy often fails when confronted with significant intra-class variance caused by factors like illumination, pose, and texture. To accommodate all diverse normal samples, the decision boundary of a single prototype must become overly-general and loose, inevitably causing the model to miss subtle anomalies. To overcome this limitation, we propose PGBL (Prototype-Guided Boundary Learning), a framework that synergizes structured representation learning with targeted anomaly synthesis. First, we introduce the Multi-Prototype Compact Learning (MPCL) module, which explicitly models the complex normal feature distribution as a mixture of multiple semantic prototypes. This allows the model to learn tighter, local representations for each normal sub-pattern instead of a single loose, global boundary. Second, inspired by synthesis methods, we design the Boundary Pseudo-Anomaly Synthesis (BPAS) module. Unlike previous "blind" synthesis strategies, BPAS is a novel targeted strategy that first identifies feature points on the boundaries of the MPCL-defined clusters and then generates high-difficulty pseudo-anomalies only in these critical regions. Finally, a Discriminative Boundary Refiner (DBR) learns to shape the final decision surface by distinguishing between the compact normal clusters and the synthesized boundary anomalies. Extensive experiments demonstrate that PGBL achieves superior anomaly detection performance, significantly outperforming competitors.
Paperid: 3095,   Poster  
Authors: Guohao Zhao, Yuxin Peng
Title: PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment
Abstract: Virtual tryon (VTON) aims to render a target garment onto a person while preserving pose, identity, and fine-grained appearance. Most existing methods rely on supervised paired data, limiting cross-domain generalization, while recent training-free approaches, though more robust, require multiple diffusion calls and complex compositing, making deployment impractical. We propose PG-VTON, a single-pass, training-free framework based on Patch-Guided Reference Alignment. Our key insight is that modern inpainting diffusion models already possess strong in-context completion: given a masked person and a small garment patch, they can synthesize plausible, pose-consistent clothing without task-specific training. PG-VTON exploits this capability with two lightweight components: Patch-Anchored Identity Priming (PIP) injects a localized garment patch only in early denoising steps to anchor garment identity, and Reference-Aware Attention (RAA) strengthens attention from masked-region tokens to garment tokens to enhance detail transfer, all without modifying model weights. With a single diffusion pass, PG-VTON achieves state-of-the-art performance among training-free methods on DressCode and VITON-HD and generalizes effectively to subject insertion.
Paperid: 3096,   Poster  
Authors: Tarun Gehlaut, Difan Liu, Charu Bansal, Krutik Malani, Souymodip Chakraborty, Ankit Phogat, Matthew Fisher, Vineet Batra
Title: VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
Abstract: Recent visionlanguage model (VLM)–based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models.We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs.Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.
Paperid: 3097,   Poster  
Authors: Da Peng, Xuesong Yang, Zonghao Guo, Yichen Zhang, Chi Chen, Yidan Zhang, Yuan Yao, Fang Wan, Wei Ke, Maosong Sun
Title: FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding
Abstract: Natural videos exhibit heterogeneous temporal dynamics, with certain segments undergoing highdynamic scene transitions and others dominated by low-dynamic visual changes. However, treating all frames identically, a common practice in most MLLMs, leads to redundant visual encoding, which results in significant computational overhead. The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. FlexiVideo first employs an adaptive temporal segmentation module to estimate inter-frame differences, grouping consecutive frames into scene segments with subtle visual changes. Subsequently, a dynamical spatio-temporal embedding module adjusts the temporal window for scene-level encoding. By restructuring scene-level visual representations within a structured temporal organization, our approach models dynamics more effectively and reduces the encoding burden while preserving fine-grained visual variations. Extensive experiments show that FlexiVideo-3B consistently outperforms Qwen2.5-VL-3B across 6 general video benchmarks. Notably, when evaluated on MotionBench at 10 FPS, FlexiVideo-3B reduces visual tokens by 43.5% compared with Qwen2.5-VL-3B while achieving a 1.3% performance gain, striking a significantly better balance between efficiency and effectiveness. Code and checkpoints will be released soon.
Paperid: 3098,   Poster  
Authors: Jinjia Peng, Jican Tan, Jiazuo Yu, Zeze Tao, Huibing Wang
Title: Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
Abstract: Lifelong Person ReIdentification (LReID) aims to adapt to new domains while preserving old knowledge. Existing methods, whether distillation-based or rehearsal-based, attempt to consolidate diverse knowledge within a fixed model architecture. However, the limited adaptability of such architectures often leads to catastrophic forgetting by overwriting previously acquired knowledge. To overcome this limitation, we propose Versatile Incremental Adaptation (VIA), a novel dynamical expansion framework for LReID, which unleashes restricted knowledge during continuous learning by large pre-trained models. Specifically, Unseen-domain person Adapter (UnA) is embedded in VIA, which employs incremental modular learning to capture the domain's specific knowledge, thereby reducing cross-domain interference and releasing task-specific capacity that is previously limited by static parameter sharing. Meanwhile, considering the substantial amount of sharing knowledge across domains in LReID, we design the Domain-aware Dispatch (DAD) module to enable inter-domain collaboration and knowledge reuse through adaptive cooperation among multiple shared adapters. Furthermore, a Holistic Domain Controller (HDC) is designed to dynamically regulate the learning capacity for new domains based on knowledge similarity, thereby effectively unleashing the generalization potential of pre-trained models. Additionally, a lightweight Similarity-Guided Auto-Selector (SGAS) is proposed to assign inputs to relevant adapters during inference automatically. Extensive experiments are conducted to validate the effectiveness of VIA, which surpasses state-of-the-art methods across both seen and unseen domains.
Paperid: 3099,   Poster  
Authors: Yunpeng Yin, Lihan Wang, Zhaoshen He, Xinqiang He, Xingming Liao, Zhuowei Wang, Lianglun Cheng
Title: MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
Abstract: Maritime multimodal vision faces significant challenges due to the complexity and variability of oceanic weather and environmental conditions. While modern vessels are commonly equipped with visible and infrared imaging systems, the complementary nature of these modalities fundamentally depends on accurate crossmodal registration. However, the absence of paired visible–infrared datasets that realistically capture diverse maritime scenarios has severely hindered progress in this field. To overcome this limitation, we present MMVIP, the first large-scale visible–infrared maritime vision dataset covering a wide spectrum of weather conditions and sea states. The dataset contains 128,100 images and 50 video sequences with precise spatial–temporal alignment. Comprehensive evaluations across image registration, fusion, maritime object detection, and cross-modal image translation tasks demonstrate the dataset’s effectiveness and challenge. Furthermore, MMVIP establishes a new benchmark for advancing multimodal maritime perception. The dataset and corresponding benchmarks are publicly available.
Paperid: 3100,   Poster  
Authors: Pengcheng Luo, Zexi Jia, Yijia Zhong, Jinchao Zhang, Jie Zhou
Title: GROW: Watermark Generation with Progressive Guidance for Diffusion Models
Abstract: Digital watermarking is a cornerstone for copyright protection. With the rapid advancement of generative models like diffusion models, ingeneration and training-free watermarking techniques have garnered more attention for their endogeneity and convenience. These methods typically embed a watermark into the initial noise, where watermark extraction relies on Denoising Diffusion Implicit Models (DDIM) inversion. However, the computationally intensive extraction process severely hinders their path toward practical deployment. To overcome this critical bottleneck, we propose GROW, a novel training-free paradigm that reframes watermarking from a one-shotembedding'' to a progressivegrowth''. By progressively guiding using frequency-domain gradients, GROW naturally weaves the watermark into the image, which enables inversion-free extraction. Comprehensive experiments on multiple datasets show that GROW not only achieves superior robustness and imperceptibility but also offers a detection speed nearly 100x faster than inversion-based techniques. The code will be made publicly available.
Paperid: 3101,   Poster  
Authors: Wanting Geng, Xin Chen, Chuanyu Sun, Jie Zhao, Ben Kang, Dong Wang, Huchuan Lu
Title: TGTrack: Temporal Generative Learning for Unified Single Object Tracking
Abstract: Existing single object trackers typically treat temporal modeling superficially by passing limited interframe information, such as propagated tokens or template updates, without intrinsic temporal supervision learning. To address this limitation, we propose TGTrack, a new unified tracking framework that incorporates a temporally generative supervision task to guide the model in learning temporal dynamics. The core of TGTrack is a temporally generative learning paradigm equipped with a transformer-based generative decoder, which consists of a gated fusion module and an autoregressive prediction mechanism. This joint design enables the model to infer future scenarios from preceding information, thereby improving its ability to model both visual appearance and temporal dynamics. Furthermore, we introduce a time token embedding to explicitly encode the temporal position of each frame. Experiments on 11 benchmarks spanning five modalities show that TGTrack achieves state-of-the-art performance in robust unified tracking. For instance, TGTrack-B384 achieves an AUC of 75.3% on LaSOT. Code and models will be made available.
Paperid: 3102,   Poster  
Authors: Mashiat Mustaq, Xavier Michel Tricoche
Title: Globscope: Toward a Global View of the Loss Landscape
Abstract: Understanding the global structure of neural network loss landscapes is important for gaining insight into model merging, hyperparameter selection, generalization, and the relationships between distinct solutions. Visualizing the global structure of loss landscapes is very challenging because of the high dimensionality of the parameter space of neural networks. Prior work has primarily focused on visualizing the loss landscape around one single basin, missing how different minima or basins relate to each other. We introduce Globscope, a framework for providing a global view of the loss landscape across multiple solutions or basins. Globscope learns a lowdimensional non-linear manifold of model parameters using an autoencoder framework, enabling both latent-space visualization and reconstruction of full model weights. Then it summarizes the relations among minima and connecting regions on this manifold through topological data analysis. Our framework produces continuous, interpretable visualizations that reveal global connectivity patterns in the landscape. We compare Globscope with kernel-based methods and demonstrate how it performs in preserving the global structure across diverse solutions. We further show how Globscope can be used to analyze two applications: revealing global low-loss solution pathways between distinct solutions using mode connectivity algorithms, and visualizing permutation symmetries of different solutions using re-basin approaches.
Paperid: 3103,   Poster  
Authors: Yonghao Zhao, Yupeng Gao, Jian Yang, Jin Xie, Beibei Wang
Title: GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
Abstract: Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multiview images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3DGaussianObjectRemoval in theIntrinsicSpace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decomposes the scene into intrinsic components and explicitly models light transport to maintain global lighting effects consistency. Furthermore, we introduce an intrinsic-space inpainting module that operates directly in the material and lighting domains, effectively addressing the challenges posed by non-Lambertian surfaces. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework substantially improves the physical consistency and visual coherence of object removal, outperforming existing methods by 13% in perceptual similarity (LPIPS) and 2dB in peak signal-to-noise ratio (PSNR).
Paperid: 3104,   Poster  
Authors: Baicheng Chen, Mingda Zhang, Min Zhang, Haizhou Li, Baoyuan Wu
Title: AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents
Abstract: Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly vital for complex task automation. However, their capacity for selfdriven decision-making introduces significant, yet underexplored, security risks, among which backdoor attacks pose a particularly stealthy and high-impact threat. Prior work has shown GUI agents vulnerable to such attacks, but existing methods rely on static trigger-action mappings that execute fixed, context-agnostic behaviors, making them highly detectable. To address this limitation, we introduceAdapAction, a novel backdoor attack that subverts the agent’s decision-making by embedding anadaptive, context-aware policy. Unlike traditional approaches, AdapAction enables the agent to autonomously select environmentally coherent malicious actions based on the current GUI state and user instruction, thereby evading detection while preserving functional utility. Extensive experiments on the Android-In-The-Zoo (AitZ) and AndroidControl benchmarks show that AdapAction achieves up to 100% Attack Success Rate (ASR) while preserving benign task utility. More critically, AdapAction consistently evades a multi-principle-based LLM defense evaluating instruction alignment, visual coherence, and safety, whereas traditional fixed-action attacks are nearly 100% detected. This resilience stems from AdapAction’s contextually grounded malicious actions, which are semantically and visually indistinguishable from legitimate operations. As a result, AdapAction exhibits exceptional stealth and poses a significantly greater real-world threat to LLM-powered GUI agents.
Paperid: 3105,   Poster  
Authors: Jie Qiu, XIN LI, Fan Yang, Yan Wang, Dong Yu, Changying Wang, Linwei Dai, Yongxiang Chen, Youqin Chen, Jianzhang Chen
Title: HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
Abstract: Highresolution remote sensing imagery exhibits complex spatial regularities where topology, continuity, and region adjacency govern semantic organization. However, existing remote sensing image semantic segmentation (RSISS) networks, being predominantly discriminative, estimate strong posteriors from data while lacking generative priors that encode such structural dependencies. This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-domain generalization. We address this challenge by reformulating RSISS as posterior inference grounded in generative structural priors, introducing \bf HySeg, a hybrid generative–discriminative segmentation paradigm that learns structure-consistent priors through generative modeling and guides posterior inference for remote sensing segmentation. At its core, the MeanStruct module, a MeanFlow-based generative prior learner, models semantic topology as a continuous stochastic field, while the Prior-to-Affinity Projection (P2A) dynamically transforms this field into topology-aware, class-conditional affinities that guide posterior inference in the Dynamic Affinity-driven Segmentation (DAS) head. Our approach is model-agnostic and seamlessly integrates with diverse backbones, consistently improving structural coherence and generalization. Across four challenging RSISS benchmarks, HySeg achieves state-of-the-art performance and advances remote sensing segmentation from appearance-based perception to structural reasoning. All code and models will be released upon publication.
Paperid: 3106,   Poster  
Authors: Xufan He, Yushuang Wu, Xiaoyang Guo, Chongjie Ye, Jiaqing Zhou, Tianlei Hu, Xiaoguang Han, Dong Du
Title: UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
Abstract: Partlevel 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry–segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
Paperid: 3107,   Poster  
Authors: Sitong Gong, Yunzhi Zhuge, Lu Zhang, Jiazuo Yu, Pingping Zhang, Xu Jia, Huchuan Lu
Title: Reinforcing Video Object Segmentation to Think before it Segments
Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into \SEG tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce VeasonR1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 \mathcalJ\&\mathcalF in ReVOS and +10.0 \mathcalJ\&\mathcalF in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 \mathcalR).
Paperid: 3108,   Poster  
Authors: Shahira Abousamra, Asmita Sood, Sylvia Plevritis
Title: TopoSlide - Topologically-Informed Histopathology Whole Slide Image Representation Learning
Abstract: Histopathology whole slide images are massive gigapixel images that present significant challenges in generating effective representations that accurately capture their histological content and the spatial organization of their various components. In this study, we introduce TopoSlide, a novel approach for selfsupervised representation learning specifically designed for whole slide histopathology images. Our method leverages topological features of image data to optimize the learning process. We demonstrate that TopoSlide, even when trained on relatively small datasets, achieves comparable or superior performance to existing pathology foundation models across multiple retrieval and linear probing benchmarks.
Paperid: 3109,   Poster  
Authors: Hongyu Zhang, Haipeng Chen, Zhimin Xu, Chengxin Yang, Yingda Lyu
Title: Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization
Abstract: Diffusion models (DMs) demonstrate strong capabilities in generating anatomically realistic medical images, enabling promising avenues for improving model generalization via synthetic augmentation. However, bridging the gap between generative prowess (realism) and measurable improvements in downstream generalization (utility) remains a key challenge. This work unifies theory and practice to tackle two central questions: (1) What to synthesize? We identify synthetic adversariality—the expected empirical loss induced by synthetic data—as a key driver of generalization. Crucially, only native adversariality (i.e., hard examples drawn from the DM's distribution) yields consistent improvements, while artificial adversariality from attackstyle perturbations degrades performance. (2) How to synthesize? We introduce the Adversariality Miner, a lightweight, plug-and-play module that efficiently selects initial noise to elicit native adversarial samples, without modifying or retraining the DM. Extensive experiments across diverse diffusion backbones and medical benchmarks confirm the effectiveness of our approach, establishing a principled path toward diffusion-driven generalization.
Paperid: 3110,   Poster  
Authors: Yearang Lee, Ho-Joong Kim, Seong-Whan Lee
Title: TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection
Abstract: ZeroShot Temporal Action Detection (ZSTAD) aims to localize and recognize action instances from unseen action categories in untrimmed videos. Although existing methods have shown effectiveness by advancing architectural text-video alignment, they still struggle with capturing semantic distinctions between action classes, resulting in text-irrelevant predictions.To address this issue, we propose a Text-Foreground Concentrated Alignment for zero-shot temporal action DEtector (TF-CADE) that explicitly aligns textual information with action-relevant foreground regions.Specifically, we introduce Action Concentrate Aggregation (ACA), which extracts action concentrate scores to aggregate temporally informative video segments into a foreground-weighted video embedding.This foreground concentrated alignment enhances the semantic consistency between text and video features and improves inter-class discriminability.In addition, a Certainty-based Confidence Re-weighting (CCR) strategy refines per-snippet confidence scores by leveraging foreground-aware similarity, effectively suppressing irrelevant action classes during inference.Extensive evaluations show that our TF-CADE not only achieves state-of-the-art performance under in-distribution settings but also excels in cross-dataset generalization to unseen action classes.
Paperid: 3111,   Poster  
Authors: Zhehan Kan, Xinghua Jiang, Yanlin Liu, Xiaochen Yang, ZHIXIANG WEI, Shifeng Liu, Yubo Zhu, Qingmin Liao, Wenming Yang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun
Title: UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm
Abstract: Despite remarkable advancements in multimodal large language models (MLLMs), their finegrained visual understanding is constrained by reliance on pure textual supervision. To unify understanding and generation capabilities, unified autoregressive multimodal models introduce visual supervision; however, they impair multimodal understanding due to the effects of visual feature discretization and orthogonality between image-text loss gradients. In this paper, we observe that pixel-level image patches and textual tokens coexist in raw high-dimensional spaces with inherent input symmetry. Motivated by this insight, we propose UVU, a novel vision-language unified autoregressive framework that eschews vector quantization. It uniquely employs continuous visual encoding for lossless representation of visual inputs and proposes a large-scale iterative hierarchical clustering algorithm to construct a pixel-level visual codebook, thereby extending the vocabulary for unified supervision and enabling autoregressive generation of pixel-level image tokens alongside textual tokens. UVU effectively synergizes pixel-level visual perception with semantic-level visual understanding, internalizing visual generation capabilities and, for the first time, unlocking the facilitative role of visual supervision in enhancing understanding. Extensive experiments across multiple tasks demonstrate that MLLMs are capable of achieving superior multimodal understanding performance under the supervised learning paradigm of UVU.
Paperid: 3112,   Poster  
Authors: Guohao Sun, Yufei Wang, Sizhuo Ma, Yuege Xie, Yuting Cheng, ZHIQIANG TAO, Jian Wang
Title: IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
Abstract: Visionlanguage models (VLMs) with dynamic resolution vision encoders achieve strong performance, but face significant efficiency challenges due to long input sequences. A common approach is to assess the importance of tokens and prune those that are less informative. Recent methods utilizing a small VLM to provide the importance map of visual tokens have outperformed existing rule-based and similarity-driven pruning approaches, particularly under high pruning ratios. However, directly using the small VLM remains unreliable, as it utilizes the aggregated visual attention weights as importance score, which can lead to noisy guidance if the generated tokens are incorrect.To address this, we invert the approach by having it detect non-informative visual tokens according to the user's input query. By adding a variational information bottleneck in the small VLM, we can approximate the entropy of each visual token as pruning guidance. Such a posteriori-guided pruning method allows the large VLM to retain its reasoning capacity with improved efficiency.Extensive experiments on eight benchmarks demonstrate the effectiveness of our approach. With only 5% of visual tokens retained, the large VLM preserves 95% of its original performance, outperforming the state of the art by 8%.
Paperid: 3113,   Poster  
Authors: Congcong Bian, HaoLong Ma, Hui Li, Zhongwei Shen, Xiaoqing Luo, Xiaoning Song, Xiaojun Wu
Title: FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion
Abstract: Spatial registration across different visual modalities is a critical but formidable step in multimodality image fusion for real-world perception. Although there are several methods are proposed to address this issue, the existing registration joint fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross modality registration method guided by visual priors is proposed for multi-modality image fusion task, termed as FusionRegister.Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions.Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guide the registration process to focus only on regions affected by misregistration, thereby avoiding redundant operation. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment, robustness, and adaptability, making it highly suitable for any infrared and visible image fusion method. The code is available in supplementary material.
Paperid: 3114,   Poster  
Authors: Haolin Li, Yaohua Wang, Ze Yan, Lijie Wen, Biqing Huang
Title: AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection
Abstract: Large multimodal language models have made rapid progress on vision–language tasks, yet their potential for zero/few-shot object detection (ZSOD/FSOD) under a closed set of target classes remains underexplored. ZSOD/FSOD is hampered by data scarcity and catastrophic forgetting. Although vision–language models (VLMs) report strong numbers on several benchmarks, they typically rely on massive visual pretraining, which is misaligned with FSOD’s goal of testing generalization to novel classes under limited supervision. We introduce AgentDet, a shared-blackboard multi-agent framework that unifies ZSOD and FSOD via pseudo-incremental learning. AgentDet decouples detection into four cooperating roles—Agent-Scout, Agent-Pinner, Agent-Curator, and Agent-Judge—which collaboratively maintain a Shared Blackboard and a Knowledge Base. For efficiency, we train only Agent-Judge—updating its image encoder and LLM-based detection head—yielding a lightweight recipe that encourages generalization to previously unseen categories. On PASCAL VOC and MS COCO ZSOD/FSOD protocols, AgentDet delivers strongly competitive performance with state-of-the-art results in several settings. Ablations confirm the contributions of blackboard collaboration, safe-write policies, and the pseudo-incremental schedule.
Paperid: 3115,   Poster  
Authors: Qingdong Xu, Jiajun Zhu, Shilin Zhu, XinJingHe XinJingHe, Chao.Lu Chao.Lu, wanghuanran wanghuanran, Jiyao Zhang
Title: PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
Abstract: We propose PatchScene, a novel diffusionbased framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.
Paperid: 3116,   Poster  
Authors: Yurong Gao, Zicheng Zhang, Congying Han, Tiande Guo, Xinmin QIu
Title: Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
Abstract: Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the scorematching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution (t \to 0). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression target. To resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB). Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint.Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach.
Paperid: 3117,   Poster  
Authors: Kazu Mishiba
Title: Semantic Scale Space: A Framework for Controllable Image Abstraction
Abstract: Image abstraction, a fundamental component of nonphotorealistic rendering (NPR), aims to simplify photographs into stylized depictions while preserving perceptually important structures. A central difficulty is selectivity: removing fine textures while preserving semantically meaningful boundaries. Existing approaches often expose only a few entangled controls, so smoothing strength and structural scale cannot be adjusted independently, which limits intuitive user control.We propose the Semantic Scale Space (SSS), a framework that organizes abstraction on two decoupled axes, abstraction strength and semantic granularity. SSS externalizes the stopping criteria by using a controllable semantic boundary detector to specify which structures act as barriers to smoothing, independently of how strongly homogeneous regions are simplified. We instantiate SSS with Adaptive Granularity Scheduling Smoothing (AGSS), which combines a donor-gated diffusion operator with a fine-to-coarse granularity schedule, and we introduce an effect-matched evaluation protocol based on a Region Homogeneity Index that compares methods at matched smoothing levels. On SBD and DIV2K, AGSS achieves higher boundary preservation and lower geometric drift than strong baselines at the same degree of smoothing, and a user study shows that its abstractions are consistently preferred in downstream NPR pipelines. These results demonstrate that SSS and AGSS provide practical, controllable image abstraction for modern creative applications.
Paperid: 3118,   Poster  
Authors: Huijie Fan, Pengrui huang, Qiang Wang, Baojie Fan, Jiahua Dong, Liangqiong Qu
Title: STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Abstract: Surroundingview 3D object detection is a fundamental task in autonomous driving, which aims to locate 3D objects from multiple camera views. Existing methods predominantly followed a 2D-to-3D pipeline, leveraging 2D detectors to enhance 3D detection performance. However, these methods ignored the inherent disparities in both temporal and feature dimensional representations between 2D and 3D detection, resulting in the positional deviations in 3D space. Furthermore, the absence of temporal information in 2D detection leads to object omission in occluded scenarios. To address these limitations, we propose STUR3D, a unified framework that builds spatio-temporal alignment between 2D and 3D perception. First, we project historical 3D detection features onto the 2D image plane, guiding the 2D detector to distill the requisite representations for 3D detection, thereby harmonizing feature representations across different dimensional spaces. Second, we integrate temporal information into 2D detection to establish temporal coherence to unify spatio-temporal reasoning across both paradigms, which yields more robust and accurate 3D detection in dynamic scenes. Additionally, we integrate depth cues into feature encoding to guide the lifting of 2D detections into 3D queries, suppressing their inherent biases. Extensive experiments on the nuScenes benchmark demonstrate the effectiveness of our framework, and STUR3D achieves state-of-the-art results of 57.9% mAP and 64.6% NDS on the nuScenes \test set.
Paperid: 3119,   Poster  
Authors: Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah
Title: GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
Abstract: Accurate and fast localization is vital for safe autonomous navigation in GPSdenied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.
Paperid: 3120,   Poster  
Authors: Yuqi Lin, Hao Zhang, Wenqi Shao, Shiqu Liu, Zhihong Gu, Wenxiao Wang, Xiaofei He, Kaipeng Zhang
Title: MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
Abstract: Current semantic segmentation models are very datahungry and require massive costly pixel-wise human annotations. Generative data augmentation, which scales the train set using generative models, provides a potential remedy. In this paper, we propose MatchMask, a novel mask-centric generative data augmentation approach tailored for label-scarce semantic segmentation. By leveraging a limited set of labeled semantic masks, MatchMask generates diverse, realistic, and well-aligned image-mask pairs, thereby enhancing the performance of semantic segmentation models. Specifically, to adapt existing text-to-image models for semantic image synthesis in the few-shot setting, we first propose a Gradient Probe Method to investigate the role of each layer in the diffusion model. On this basis, a lightweight LoRA-style adapter is designed for critical layers to enable efficient adaptation, coupled with a Layer-adaptive Cross-attention Fusion mechanism. Meanwhile, we present a robust relative filtering principle to suppress incorrectly synthesized regions. Moreover, the proposed approach is extended to MatchMask++ in the semi-supervised setting to take advantage of additional unlabeled data. Experimental results on PASCAL VOC, COCO and ADE20K demonstrate that MatchMask remarkably enhances the performance of segmentation models, surpassing prior data augmentation techniques in various benchmarks, e.g., 67.5%->74.3% mIoU on PASCAL VOC. Our code will be made publicly available.
Paperid: 3121,   Poster  
Authors: Leezy Han, SEUNGGYU KIM, Dongseok Shim, Hyeonbeom Lee
Title: PR Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Abstract: Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes flickering artifacts in the resulting depth maps but may also lead to estimation failures when the depth range changes abruptly due to camera motion. To address these challenges, this paper proposes a consistencyaware monocular depth estimation framework that leverages wheel odometry information from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we develop pose estimation and sparse depth estimation modules based on optical flow computed from consecutive image frames. The resulting sparse depth estimates are then used to rescale and refine the relative depth predicted by a pre-trained depth estimation foundation model. Through this refinement process, our method produces dense and temporally consistent depth maps. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our-collected datasets, demonstrating robust and accurate performance in both pose estimation and depth prediction tasks.
Paperid: 3122,   Poster  
Authors: Yuzhou Liu, Lingjie Zhu, Hanqiao Ye, Yujun Liu, Shangfeng Huang, Xiang Gao, Ruisheng Wang, Shuhan Shen
Title: BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning
Abstract: In this paper, we propose BuildingGPT, a novel autoregressive model for building wireframe reconstruction from point clouds with reinforcement learning.Unlike prior works based on detection or diffusion models, BuildingGPT reformulates the building wireframe reconstruction task into a sequence prediction problem.Based on the hierarchical building wireframe tokenization, the wireframe sequences are organized in a structurally- and semantically-aware order for the next-token prediction.The point cloud encoder first transforms the input point cloud into a fixed-length latent code that serves as the starting of the sequence.Then, BuildingGPT auto-regressively predicts tokens conditioned on the latent code and previously generated tokens.With token sequence predicted, the building wireframe is obtained through detokenization.To enhance the model performance, we adopt a two-stage training paradigm including the pre-training and post-training.After the auto-regressive pre-training, Direct Preference Optimization (DPO) is employed as a post-training strategy to align reconstruction results with human preferences.Extensive experiments on the large-scale MunichWF dataset show that BuildingGPT outperforms existing state-of-the-art methods.We commit to release the code and dataset.
Paperid: 3123,   Poster  
Authors: Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng
Title: CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
Abstract: We present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarseto-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging.
Paperid: 3124,   Poster  
Authors: Jungwook Seo, Minjeong Kim, Younkwan Lee, Seungho Shin, Sungyong Baik
Title: Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
Abstract: Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a trainingfree Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves—the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.
Paperid: 3125,   Poster  
Authors: Xueliang Cui, Juncai Zhang, Jiacheng Hou, Dan Lu, Hao Zhang, Ruxin Wang
Title: BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
Abstract: Visionlanguage models (VLMs) have demonstrated strong potential for adapting to downstream biomedical tasks with limited training samples. However, their generalization to unseen classes within the same dataset remains limited, as the image–text alignment semantics often rely on spurious cues present in seen classes that do not transfer.To tackle this, we propose BiomedCCPL (Causal Conditional Prompt Learning), a framework that uses VGAP (Visual Grounder with Adaptive Prototype) to generate image-conditional prompts from multi-scale adaptive prototypes and employs SCD (Synergistic Causal Disentanglement) to regularize the generation of image-conditional prompts.Guided by insights from a causal analysis of generalization to unseen classes, SCD leverages multiple synergistic learning objectives to perform front-door adjustment, ensuring that the dynamically generated image-conditional prompts focus on underlying diagnostic image features shared across seen and unseen classes.Experiments on 11 datasets across 9 modalities demonstrate that BiomedCCPL effectively enhances the model's data efficiency and generalization ability.In particular, on the Base-to-Novel task, BiomedCCPL achieves an average HM of 79.98%, surpassing the previous state-of-the-art by 6.45%.
Paperid: 3126,   Poster  
Authors: Yue Wu, Tao Peng, Yongzhe Yuan, Kaiyuan Feng, Hao Li, Maoguo Gong, Qiguang Miao, Wenping Ma
Title: SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
Abstract: With the growing accessibility of largescale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constraints through Mutual Geometric Consistency Prior to identify minimal highly reliable unchanged seeds. From these seeds, Stability-Guided Controlled Attention modules progressively propagate stability from stable regions to neighboring uncertain points, enabling unchanged regions to grow layer-by-layer from interior cores toward boundaries. This coarse-to-fine growing process naturally forms coherent regions, avoiding isolated noise while achieving compact, well-defined boundaries through progressive expansion. Extensive experiments on the synthetic dataset Urb3DCD and the real-world dataset HKCD demonstrate that SRGCD achieves state-of-the-art performance, significantly improving interior completeness and boundary compactness compared with existing methods.
Paperid: 3127,   Poster  
Authors: shikun zhang, Yong Li, Yiqun Wang, Qiuhong Ke, Cunjian Chen
Title: Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
Abstract: We propose Fresco, a unified optimization paradigm designed to mitigate early oversharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields consistent, high-fidelity results across views. Experiments on the NeRSemble dataset validate the effectiveness of our design. Our method outperforms previous state-of the-art methods while avoiding additional training overhead through frequency scheduling and UV-bake caching.
Paperid: 3128,   Poster  
Authors: Merve Gulle, junno yun, Yasar Utku Alcalar, Mehmet Akcakaya
Title: PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
Abstract: Diffusion models have found extensive use in solving numerous inverse problems. Such diffusion inverse problem solvers aim to sample from the posterior distribution of data given the measurements, using a combination of the unconditional score function and an approximation of the posterior related to the forward process. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling highquality sampling in just a few NFEs. CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, not amenable to large-scale problems. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. We propose a solver based on PnP-ADMM, which enables us to leverage the fast convergence of conjugate gradient method. We further accelerate this with noise injection and momentum, dubbedPnP-CM, and show it maintains the convergence properties of the baseline PnP-ADMM. We evaluate our approach on a variety of inverse problems, including inpainting, super-resolution, Gaussian deblurring, and magnetic resonance imaging (MRI) reconstruction. To the best of our knowledge, this is thefirst CM trained for MRIdatasets. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and can produce meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming comparable CM-based approaches.
Paperid: 3129,   Poster  
Authors: Heng Li, Xingyuan Wang, Yang Fan, Yunan Zhang, Xiangping Wu, Qingcai Chen
Title: MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration
Abstract: Restoring degraded document image is essential for both improving visual quality and optimizing performance in downstream document analysis tasks. Although existing methods have demonstrated substantial improvements in restoration outcomes, they primarily address singletype degradation scenarios. Current approaches typically necessitate training multiple specialized models for specific degradation types or rely on explicit prior knowledge of degradation patterns to guide the training process. To overcome these limitations, we propose MMDIR, a multimodal instruction-driven framework designed for document image restoration under mixed and uncertain degradation conditions. By leveraging semantically structured instructions, MMDIR dynamically identifies present degradation types (blur, shadow, text watermark, and seal), while enhancing degradation-aware representation learning. Furthermore, we introduce a novel benchmark named MixedDoc comprising complex mixed degradations, where each image contains randomized combinations of the aforementioned types. This benchmark addresses a critical gap in existing datasets, which lack realistic multi-degradation samples and often overlook common obstructions such as seals and text watermarks. The effectiveness of our approach is thoroughly validated across both released public benchmarks and our newly proposed dataset.
Paperid: 3130,   Poster  
Authors: Naiyu Yin, Hanjing Wang, Yue Yu, Tian Gao, Amit Dhurandhar, Chung-Hao Lee, Qiang Ji
Title: CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
Abstract: Causal graphs play a crucial role in AI research as they reveal the data generation processes underlying realworld machine learning and computer vision tasks. Recent studies have leveraged causal graphs to develop more robust and interpretable models. However, limited or biased data often lead to inaccurate causal graph estimation, reducing a model’s transferability to unseen domains. To address this challenge, we propose a novel framework that performs Bayesian inference over causal graphs to capture potential underlying causal relations and identify invariant causal features for DG prediction. The key advantage of our framework lies in its ability to quantify causal graph uncertainty in the context of prediction tasks and incorporate it into the prediction process. Our proposed uncertainty provides valuable insights into (i) the reliability of our method on specific datasets, (ii) the alignment between learned causal graphs and unseen test domains, and (iii) the confidence of our predictions. In particular, we go beyond merely quantifying uncertainty and leverage it as weighting factors in a weighted Bayesian inference scheme. Empirical results on multiple benchmark distribution-shift datasets show that our algorithm,CausalGraphUncertainty-guidedBayesian Inference (CGU-Bayes), outperforms existing DG methods on challenging datasets and achieves state-of-the-art performance overall.
Paperid: 3131,   Poster  
Authors: Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma
Title: GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
Abstract: Textto-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by deliberate adversaries.Recent research on red-teaming and adversarial attacks against T2I models faces a critical limitation: existing methods struggle to balance prompt stealthiness with high toxicity in generated images. Some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models.To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical safety weaknesses.
Paperid: 3132,   Poster  
Authors: Ziyao He, Yingjie Liu, ZhangYangRui ZhangYangRui, Mingsong Chen, Xuan Tang, Xian Wei
Title: Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
Abstract: Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve finegrained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. In this work, we propose a novel \textscCurvature-Aware Captioning framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.
Paperid: 3133,   Poster  
Authors: Masafumi Mori, Shinya Gongyo, Mitsuru Ambai
Title: Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
Abstract: Posttraining quantization (PTQ) enables rapid deployment of deep pretrained models. In the low-bit regime, recent PTQ methods for vision models adopt asymmetric quantization (AsymQ), introducing zero-point offsets to mitigate quantization errors. However, these offsets impose substantial hardware overhead and fail to fully capture the non-symmetric structure of pretrained weight distributions, leaving many quantization levels unused.In this paper, we reveal a hidden symmetry in the pretrained weights: after removing a few sparse outliers, the distribution becomes nearly symmetric.Accordingly, we propose Dense and Additive Sparse Quantization (DASQ), which decomposes the weights into dense and sparse matrices.The dense component captures the symmetric structure around zero, while the sparse component models the removed outliers, and both can be processed in parallel and can be implemented with efficient zero-point-free computation.Experiments on image classification, object detection, and instance segmentation show that DASQ surpasses state-of-the-art PTQ methods with lower BOPs. On an FPGA, DASQ also demonstrates higher accuracy and lower power consumption than AsymQ at comparable throughput.
Paperid: 3134,   Poster  
Authors: Miaowei Wang, Qingxuan Yan, Zhi Cao, Yayuan Li, Oisin Mac Aodha, Jason Corso, Amir Vaxman
Title: BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
Abstract: Textguided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. The code is available in the supplemental material and will be made publicly available upon publication.
Paperid: 3135,   Poster  
Authors: Guanghui Shi, Xuefeng Liang, Qixiang Wen
Title: Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution
Abstract: Dataset Distillation (DD) aims to compress largescale datasets into a small number of condensed Images Per Class (IPC), enabling efficient network training. Previous core-set selection and synthetic-based DD methods achieve reasonable performance. However, our in-depth investigation reveals that existing methods share a common issue: pattern imbalance. Specifically, they either overemphasize class-general patterns representing the majority of each class or focus on fewer marginal patterns critical for model generalization. To address this issue, we propose a novel framework, Balanced Patterns Selection (BPS). Unlike prior methods that assume each class forms a single cluster, BPS models the multiple visual pattern distribution within each class via a hierarchical semantic structure inherent to the dataset. It then selects two complementary subsets in a balanced manner from the center (class-general patterns) and the margins (marginal patterns) of each pattern, producing a pattern-balanced coreset. Theoretically, we prove that the BPS-selected coreset aligns with the original dataset in both distribution and performance. Moreover, its model-agnostic selection nature ensures cross-architecture generalization, while the Optimize-Once-for-All-IPCs property guarantees efficiency. Extensive experiments on four benchmarks demonstrate that BPS significantly outperforms existing state-of-the-art methods.
Paperid: 3136,   Poster  
Authors: Li-Jun Zhao, Zhen-Duo Chen, Xin Luo, Xin-Shun Xu
Title: From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
Abstract: Fewshot fine-grained image classification (FSFG) aims to recognize novel fine-grained categories from only a few labeled samples. Existing FSFG methods primarily focus on fine-grained feature extraction and modeling query–support interactions within training episodes containing a small number of classes. Relying on the episodic training strategy, these methods typically assume that the capabilities learned on training samples can directly transfer to evaluation episodes with a few novel classes (few-way). However, in more practical and challenging scenarios involving many novel classes (many-way), existing approaches lack a reliable and global characterization of the feature space, making it difficult for episodic adaptation alone to generalize effectively. In this paper, we pioneer a theoretical analysis of novel class behavior in FSFG and derive a class discriminative index bound. Guided by this analysis, we propose a novel SCEG method that incorporates Self and Collaborative feature extraction as well as Episodic and Global feature space optimization. Extensive experiments demonstrate that our method consistently and significantly outperforms existing methods under both conventional few-way and the new many-way settings.
Paperid: 3137,   Poster  
Authors: Sara Sabour, Richard Tucker, Marcus A. Brubaker, Saurabh Saxena, Junhwa Hur, Andrea Tagliasacchi, Deqing Sun, David J. Fleet, Richard Szeliski, Noah Snavely
Title: ORBIT: Benchmarking SfM in the Wild with 360° Video
Abstract: Structurefrom-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes.Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed.To address this gap, we introduce a new benchmark for evaluating camera pose estimation.Our key insight is to leverage online panoramic 360° as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery.The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects.By tracking camera motion across full 360° videos, we crop and reproject selected portions to generate perspective-view clips that serve as our benchmark---ORBIT---a diverse collection of 100 video clips.Experiments show that COLMAP and other state-of-the-art SfM methods struggle to accurately estimate camera positions on our benchmark, indicating that it remains a challenging and open problem space for future research.As a result, ORBIT provides a valuable testbed where researchers can meaningfully compete and measure progress on truly challenging, real-world SfM problems.
Paperid: 3138,   Poster  
Authors: Chuan Mao, Haoqi Yuan, Ziye Huang, Chaoyi Xu, Kai Ma, Zongqing Lu
Title: DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
Abstract: Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, finegrained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components — grasping style and affordance — and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance.Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
Paperid: 3139,   Poster  
Authors: Wei Feng, Yiwen Jiang, Sijin Zhou, Zhuang Qi, Zhongxing Xu, Zhonghua Wang, feilong tang, Zongyuan Ge
Title: Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
Abstract: Generalized Category Discovery (GCD) aims to transfer knowledge from known categories to automatically discover new, unseen ones while preserving recognition of the known classes. Despite recent progress, existing GCD approaches typically assume that all data are drawn from the same distribution, which is rarely valid in realworld scenarios. In practice, data often experience simultaneous domain shifts and novel category emergence, causing severe performance degradation of existing systems. To address this challenge, we propose CausalGCD, a causality-inspired framework designed to mitigate domain-shift bias in category discovery. Specifically, we first analyze the causal graph to uncover the relationships among key variables in cross-domain GCD. We then introduce the concept of causal dependency risk and propose a Causal Dependency Risk Estimator to capture causal semantics, further deriving a theoretically computable upper bound to optimize this risk under cross-domain GCD settings. Furthermore, we propose a Causal Geometric Manifold Constraint that enforces invariant manifold-level associations between known and unknown categories across domains, thereby facilitating robust discovery of novel classes. Extensive experiments on the SSB-C and DomainNet benchmarks demonstrate the effectiveness of CausalGCD and highlight the significance of causal reasoning in open-world category discovery.
Paperid: 3140,   Poster  
Authors: Yaozong Zheng, Qihua Liang, Bineng Zhong, Shuimu Zeng, Yuanliang Xue, Ning Li, Shuxiang Song
Title: Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
Abstract: Learning robust contextual knowledge from unlabeled videos is essential for advancing selfsupervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \tracker, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments on multiple tracking benchmarks demonstrate the superiority of our method, achieving SOTA performance.
Paperid: 3141,   Poster  
Authors: Haidong Wu, Snehal Bhayani, Janne Heikkilä
Title: Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
Abstract: Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbnerbasis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gröbner-basis and resultant-based solvers.
Paperid: 3142,   Poster  
Authors: Weibo Shu, Antoni B. Chan
Title: Adapting Lightweight Image-based Counting Models for Video Crowd Counting
Abstract: Video crowd counting aims to predict the people count in each frame of a video. It requires effectively leveraging spatiotemporal (ST) information in videos while satisfying real-time constraints. However, most existing methods use ST information from neighboring frames through auxiliary extraction and fusion modules---resulting in large computational cost and the need to buffer multiple frames during inference. Such designs limit their practicality in real-world applications with limited computational resources or stringent real-time requirements. To address these issues, we revisit video crowd counting from the perspective of lightweight image-based counting models that enable real-time deployment under limited resources. We analytically define ST information in a model-independent and statistically interpretable manner, and incorporate it into training via a statistical regularizer that effectively enhances model performance without adding modules or inference overhead. Most framework hyperparameters are further formulated as statistical inference problems, allowing automatic estimation from data and consequently efficient adaptation to new scenarios.Our framework unifies video crowd counting and image-based counting models under a compact, principled formulation that is lightweight, portable, and efficient. We also establish theoretical foundations for adapting image-based counting models to video crowd counting and achieve state-of-the-art accuracy and efficiency across six benchmarks, including challenging DRONECROWD and VSCROWD.
Paperid: 3143,   Poster  
Authors: Venkata Kesav Venna, Sai Madhusudan Gunda, Jyothi Swaroopa Jinka, Hrithik Rachakonda, Anirudh Srinivasan, Ravi Kiran Sarvadevabhatla
Title: M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
Abstract: Document QArequires not only accurate answers but also identifying where each answer is grounded on the page. Most models treat the task as textonly generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduceM3Grounder, a hybrid vision–language and segmentation architecture that formulates document grounding as pixel-level segmentation. It produces fine-grained evidence masksrefined by a bleed-suppression loss to prevent spillover. M3Grounder autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. Also,M3Grounder grounds evidence hierarchically across phrase, line, and block levelsusing an enclosure loss that enforces spatial containment. We releaseGroundingDocQA dataset (200K documents, 2M multi-span and multi-granular QA pairs with pixel-level grounding masks), built through a data engine that handles complex layouts, curved-text, and graphics-rich documents. We also releaseGroundingDocQA-Bench, a diverse and challenging human-verified benchmark. M3Grounder setsa new state of the art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded evidence.
Paperid: 3144,   Poster  
Authors: Zhizhou Chen, Shanyan Guan, Zhanxin Gao, En Ci, Yanhao Ge, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai
Title: VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
Abstract: Directly editing ultrahigh-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency textual details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution (\geq4096×4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. For the second challenge, we propose a high-frequency-aware post-adaptation strategy that allows previous non-high-resolution models to accurately generate fine-grained, high-frequency details. We further present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work delivers superior fine-grained detail and texture realism in UHR image editing. The dataset and code will be released.
Paperid: 3145,   Poster  
Authors: Qianhao Luo, Jiajia Mi, Mingtao Yan, JingSheng Liu, ShuYang Pang, Weiling Li
Title: Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
Abstract: Unsupervised Anomaly Detection (UAD) and anomaly classification are frequently used in industrial and medical scenarios. Specifically, UAD identifies anomalous regions at the finegrained pixel-level, while anomaly classification distinguishes anomaly types at the anomaly region level. However, existing approaches typically treat these tasks independently and sequentially, overlooking the benefits of jointly training them to suppress Local Visual Ambiguity (LVA) caused by the similarities of different types of anomalies in local visual patterns. Moreover, a multi-task learning framework cannot be directly applied to jointly train the two tasks, since UAD and anomaly classification exhibit feature preference incompatibility. To address these limitations, we propose the Prototype-Guided Semi-Supervised Feature Disentanglement (PG-SFD) framework, which makes a paradigm shift from implicit feature sharing to explicit feature disentanglement and explicitly constructs normal and category prototypes to eliminate implicit normal-abnormal semantic coupling via a Dual-Prototype Disentanglement Module (DPRM). Moreover, for cross-task feature differential injection and gradient conflict mitigation, the Differential Gated Interaction (DGI) and Geometry-Regularized Optimization (GRO) are proposed to form a cohesive framework with DPRM. PG-SFD demonstrates high effectiveness in both UAD tasks and weakly supervised classification tasks. Meanwhile, it exhibits stable performance across multiple types of datasets, including industrial and medical datasets, indicating its strong generalizability.
Paperid: 3146,   Poster  
Authors: Wei Feng, Chi Zhang, Nan Li, Qian Zhang, Qi Zhang, Mingyan Li
Title: EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
Abstract: Recent VisionLanguage-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling how robots see from how robots act is a missing primitive in VLA systems. We present EgoRoC, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand–eye module corrects the action in the end-effector frame for hardware-agnostic deployment. Trained once from static wrist–target image pairs with relative poses, rather than full manipulation trajectories, EgoRoC leaves downstream VLAs unchanged. By turning egocentric alignment into a reusable capability, EgoRoC reduces training redundancy, strengthens zero-shot cross-scene transfer, and scales across VLA backbones without manual calibration.Across simulation and real settings, attaching EgoRoC consistently boosts success rates, especially on long-horizon and out-of-distribution tasks, and improves data efficiency during fine-tuning.
Paperid: 3147,   Poster  
Authors: Soumyaratna Debnath, Bui Manh, Zinan Liu, Lin Wang
Title: LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Abstract: VisionLanguage Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static. It is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS). This empowers us to design a Möbius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align the perceptual saliency with the textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering (VQA) benchmarks. The results show that ours achieves dramatic gains, with average improvements by +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, results reveal that LLMind can retain up to 82%, 92% and 97% of the full-resolution performance with only 1%, 3% and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
Paperid: 3148,   Poster  
Authors: Yuhua Wang, Qinnan Zhang, Xiaodong Li, Huan Zhang, Yifan Sun, Wangjie Qiu, Hainan Zhang, Yongxin Tong, Zhiming Zheng
Title: Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
Abstract: Prototypebased Personalized Federated Learning (ProtoPFL) enables efficient cross-domain adaptation by communicating compact class prototypes, but directly sharing prototypes raises privacy risks. A common defense involves per-example \ell_2 clipping before prototype computation to limit sensitivity, followed by the addition of isotropic Gaussian noise during upload to enforce Local Differential Privacy (LDP). However, this Isotropic Gaussian Prototype Perturbation (IGPP) often over-perturbs key discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. We propose VPDR, a client-side privacy plug-in that can be seamlessly integrated into existing ProtoPFL frameworks. Motivated by the statistical prior that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which uses groupwise calibration to apply less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further design Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise noise provides privacy guarantees no weaker than those of the isotropic mechanism under the same privacy constraints. Extensive experiments on multiple cross-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning while maintaining strong privacy protection under realistic attack scenarios.
Paperid: 3149,   Poster  
Authors: Linjun Wu, Jiejia Yu, Leyang Jin, He Wang, Bowen Zheng, Xu Yang, Hao Jiang, Fei Xia, Fei Ling, Jun Deng, Xiaogang Jin
Title: Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
Abstract: Textconditioned human motion in-betweening leverages keyframes for spatio-temporal control, with text providing high-level semantic guidance for the transitions. However, existing methods are unable to establish a coherent alignment between textual semantics and the spatio-temporal constraints provided by keyframes, often resulting in insufficiently constrained motions with unintended behavior. Moreover, they struggle with precise spatial control, often generating motions that deviate from keyframe constraints. To address these issues, we propose a multi-level diffusion framework that integrates textual semantics with implicit cues from keyframe sequences to modulate global motion dynamics, while leveraging individual keyframes to guide local transitions around them. During inference, to ensure strict keyframe adherence, we propose a novel trajectory refinement strategy that adjusts the root positions of the generated motion, followed by diffusion imputation to refine the poses of the generated keyframes. Additionally, our framework enables semantics-preserving motion editing, allowing for plausible modifications while retaining the original motion semantics. Extensive experiments demonstrate that our method generates high-quality motions that strictly satisfy keyframe constraints while achieving precise semantic alignment.
Paperid: 3150,   Poster  
Authors: Fang Liu, Yuhao Liu, Ke Xu, Gerhard Hancke, Rynson W.H. Lau
Title: GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
Abstract: In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve crossscene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Models (MLLM)-guided Reasoning Module that leverages MLLM’s semantic and spatial priors to enhance 3D localization and reasoning. To further improve spatial alignment and computational efficiency, we introduce a GeometryAware Frame Selector (GAFS), which adaptively selects the most informative views based on Gaussian and textural cues. Extensive cross-task evaluations (including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding) demonstrate state-of-the-art performances and strong generalization capability of GenSplat. We will release the codes.
Paperid: 3151,   Poster  
Authors: Chongyang Xu, Li Haipeng, Shen Cheng, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu
Title: Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Abstract: Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in realworld settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner.We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations.We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code and pretrained weights will be released.
Paperid: 3152,   Poster  
Authors: Yiyang Zou, Tianhao Zhao, Peilun Xiao, Hongyu Jin, Longyu Qi, Yuxuan Li, Liyin Liang, Yifeng Qian, Chunbo Lai, Yutian Lin, Zhihui Li, Yu Wu
Title: RiskProp: Collision-Anchored Self-supervised Temporal Constraints for Early Accident Anticipation
Abstract: Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated “anomaly onset” frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose Risk Propagation (RiskProp), a collisionanchored supervised framework enhanced with self-supervised temporal constraints, which removes the need for anomaly onset annotations by leveraging only the reliably labeled collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model’s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP-DATA and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.
Paperid: 3153,   Poster  
Authors: Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao
Title: Knowing Thyself: Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized questionanswering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top models such as GPT-5 achieve only 46% accuracy, trailing human performance (85%) by a whopping 39%. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance.
Paperid: 3154,   Poster  
Authors: Youyu Chen, Junjun Jiang, Yueru Luo, Kui Jiang, Xianming Liu, Xu Yan, Dave Zhenyu Chen
Title: Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
Abstract: With recent advances, Feedforward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency.Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.
Paperid: 3155,   Poster  
Authors: Feiyu Huang, Jia Li, Zhao CHEN, Yang WU, Caleb Chen Cao, Lei Chen
Title: KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
Abstract: Crossmodal biomedical signals such as pathology and genomics can provide richer and more robust semantic guidance for medical image representation. However, semantic guidance remains limited, as privacy constraints and acquisition costs severely restrict the availability of medical images paired with other biomedical data. A further challenge is modality discrepancy, which propagates intra-modal statistical bias and cross-modal noise, degrading medical image representation quality. To this end, we propose KAMP, a large language model (LLM)--driven multimodally guided pretraining framework for medical image representation learning. KAMP leverages textual priors as semantic anchors to enhance medical image representations and align medical images with multimodal biomedical data, enabling the generation of rich and robust representations even under scarce paired data. KAMP operates in three stages. First, the LLM generates personalized diagnostic knowledge from patient clinical text and imaging metadata. We inject this knowledge as a prior to enrich the medical image representation and use it as a semantic anchor to reduce the distance between the medical image representations and other biomedical modalities. Second, the LLM is optimized via the Group Relative Policy Optimization (GRPO) strategy, with the cross-modal aligner pretrained in the first stage serving as the reward model. Third, the optimized knowledge is employed to retrain the cross-modal aligner, yielding more robust medical image representations while mitigating bias and noise introduced by other modalities.Comprehensive evaluations on brain, bladder, and liver cancer datasets demonstrate that KAMP consistently outperforms existing methods on downstream few-shot prediction tasks.
Paperid: 3156,   Poster  
Authors: Joshua Cho, Sara Aghajanzadeh, Zhen Zhu, David Forsyth
Title: MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
Abstract: The primary axes of interest in lowlight image enhancement (LLIE) are color constancy—ensuring consistent outputs across inputs of the same scene under varying illumination and noise—and generalization across diverse datasets. Existing methods, whether supervised, unsupervised, or zero-shot, rely on auxiliary loss functions and empirically selected hyperparameters, which yield strong results on the datasets used for evaluation but often exhibit limited generalization. To overcome these constraints, we propose MR. Illuminate (pronounced "Mister Illuminate"), the first deep learning-based solution for LLIE that requires no optimization and no degradation assumption. "MR." emphasizes our Modulate–Refine design: global illuminance and color are modulated via Adaptive Instance Normalization (AdaIN), while local structure and color are refined through self-attention features within a pre-trained diffusion model, taking a unique approach from prior methods. Extensive quantitative evaluations show that our approach surpasses SOTA methods on standard LLIE benchmarks, while qualitative results demonstrate improved color fidelity. Moreover, without any modification to our framework, our method achieves competitive results on the auto white balance (AWB) task, underscoring its strong generalization capability.
Paperid: 3157,   Poster  
Authors: chaocan xue, Qihua Liang, Bineng Zhong, Yanting Zu, Yuanliang Xue, Haiying Xia, Shuxiang Song
Title: Toward Low-Cost yet Effective Temporal Learning for UAV Tracking
Abstract: The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resourceconstrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. To achieve this goal, we introduce a new evaluation metric, i.e., precision per FLOPs (PPF). The PPF is introduced to quantify the tracking precision gains achieved by temporal learning components per unit of FLOPs, thus enabling fair and efficiency-aware comparisons among these components and driving them toward more efficient designs. Based on this metric, we propose a low-cost yet effective temporal learning (LETL) approach to efficiently model contextual relationships. This approach continuously propagates and merges representative appearance tokens in video streams, allowing the tracker to efficiently capture the changing patterns of targets with relatively low computational costs. We integrate the LETL approach into existing one-stream frameworks, thereby building a simple yet effective tracker, namely LETrack, for robust UAV tracking. Extensive experimental results on multiple aerial tracking datasets demonstrate the superiority of our LETrack, and show that the proposed LETL approach achieves higher PPF scores, outperforming other temporal learning strategies.
Paperid: 3158,   Poster  
Authors: Jing-Yao Zhang, Heng Zhang, Mingsen Zhang, Binbin Yang, Fei Yin
Title: SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation
Abstract: We introduce a novel method for video Scene Text Segmentation (STS), a task critical for understanding dynamic visual content. Despite the success of foundation models like SAM2 in generic segmentation, their application to video STS is hindered by the reliance on external prompts, limited output resolution, and instability in video sequences. To address these, we present a comprehensive framework based on SAM2. First, we finetune the image encoder using LoRA and integrate a self-prompting module, enabling the model to autonomously generate text-specific prompts. Second, we augment the decoder with additional upsampling branches at 512×512 and 1024×1024 resolutions, complementing the original 256×256 output to produce high-fidelity, multi-resolution masks. Third, we enhance the memory mechanism by combining short-term memory with a top-k selection strategy, ensuring temporally consistent and stable segmentation across video frames. A significant obstacle in video STS is data scarcity. To this end, we contribute two datasets: STS-SynthV, containing 1,410 synthetic video clips generated via FlowText, and STS-RealV, comprising 660 meticulously annotated real-world video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple video and image scene text benchmarks.
Paperid: 3159,   Poster  
Authors: Ziyao Wang, Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang, Ang Li, Lingjuan Lyu
Title: UniCompress: Token Compression for Unified Vision–Language Understanding and Generation
Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and crossmodal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to (4×), achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
Paperid: 3160,   Poster  
Authors: Shanliang Yang, wangxiaoxiao wangxiaoxiao
Title: Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
Abstract: The performance of multimodal learning systems, particularly in highstakes domains like automated depression recognition, is fundamentally constrained by the challenge of learning robust visual representations from limited and complex clinical data. To overcome this, we introduce Cross-Modal Guided Visual Synthesis (CMG-VS), a novel training framework that internally enhances the learning process by synthesizing new, task-relevant visual features. At its core, CMG-VS leverages the rich context from audio and text modalities to guide a conditional generative model. This model learns the intricate mapping from speech and language to visual expression, generating a diverse manifold of plausible visual behaviors to enrich the training distribution. Crucially, this synthesis is not a separate pre-processing step. Through a task-guided joint optimization scheme, the generative process is dynamically steered by the downstream multimodal recognizer's performance. This closed-loop feedback mechanism ensures the synthesized visual features are optimized to be maximally discriminative for the recognition task, rather than merely realistic. Comprehensive experiments on the widely-used DAIC-WOZ and E-DAIC benchmark datasets demonstrate that CMG-VS significantly outperforms existing state-of-the-art methods across all standard regression and classification metrics. Ablation studies further validate that our task-guided synthesis is the key driver of this performance gain, proving its effectiveness as a new paradigm for robust multimodal representation learning.
Paperid: 3161,   Poster  
Authors: Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan
Title: Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
Abstract: We introduce Illustrator’s Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist’s compositional process, illustrator’s depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms stateof-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.
Paperid: 3162,   Poster  
Authors: Ankit Dhiman, Tao Lu, Srinath Ravi, Emre Arslan, Angela Xing, Yuanbo Xiangli, R. Venkatesh Babu, Srinath Sridhar
Title: Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
Abstract: Novelview synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance error to enhance densification effectiveness. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.
Paperid: 3163,   Poster  
Authors: Yubin Gu, Boyang Hou, Yuan Meng, Wenting Luo, Jiayi Ji, Xiaoshuai Sun
Title: CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning
Abstract: Crack segmentation (CS) is crucial for structural inspection and maintenance in production scenarios. To achieve both high accuracy and efficiency, recent methods have adopted Mambabased architectures built upon state space models (SSMs), which enable linear-complexity modeling of long-range dependencies. However, existing approaches typically rely on static multi-directional scanning to flatten visual features into sequences. This fixed flattening order disrupts spatial continuity and weakens the SSM’s ability to model irregular crack patterns effectively. To address this limitation, we propose CrackSSM, a novel crack-aware segmentation framework featuring a dynamic scanning strategy that adapts the token sequence to the underlying structure of each image. Specifically, we compute directional response strength along four orientations from high-level semantic features, and use these values to reorder tokens so that crack-relevant regions remain adjacent in sequence. This alignment improves the causal modeling ability of SSMs while preserving their efficiency and better suits the irregular, fine-grained nature of cracks. Additionally, we design a wavelet-guided decoding mechanism to recover detailed features. It incorporates high-frequency components extracted from the input image and applies them to guide feature refinement and edge-aware fusion, further enhancing segmentation precision. Experiments on three benchmark datasets demonstrate that our method achieves superior segmentation accuracy with fewer parameters and faster inference compared to existing state-of-the-art models. Source code is available in supplementary materials.
Paperid: 3164,   Poster  
Authors: Xinxin Liu, Xue Wang, Guoqing Zhou, Qing Wang
Title: ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
Abstract: Jointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easyview bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view) over the view-coherent manifold. Building on the MaVOS, we further devise a reconstruction pipeline that incorporates the per-view optimizability as a state control signal to guide the joint optimization process through three key components: dynamic view scheduling, gated positional encoding, and anti-score loss weighting. Experimental results on the benchmark dataset demonstrate that ManifoldNeuS outperforms existing methods in terms of accurate pose estimation and high-quality reconstruction, achieving robust joint optimization without known camera poses.
Paperid: 3165,   Poster  
Authors: Koichiro Ito
Title: Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
Abstract: Recent advances in chart recognition have been driven by the supervised finetuning (SFT) of vision-language models (VLMs), which unify multiple related tasks, and by diversifying training corpora. In parallel, research on leveraging large language models (LLMs) for object detection has shown that jointly training phrase grounding alongside SFT enhances a model’s generative capabilities.Inspired by this, we hypothesize that chart recognition can also benefit from phrase grounding, which aligns textual phrases with chart regions—a setting that remains underexplored due to the lack of corresponding datasets.In this work, we introduce a phrase-grounding-aware SFT via a Side-Masked Attention Module (SMAM), which is inserted into each transformer layer of the LLM. SMAM performs masked attention within the annotated region—aligned with the corresponding phrase—to produce an additional logit. We supervise this logit and use it as a reference to guide the LLM’s output prediction during fine-tuning, alongside the standard SFT objective. To enable this approach, we also develop an automated pipeline for generating phrase-to-region alignments, which augments existing datasets. Experiments show that our method effectively incorporates phrase grounding into chart recognition via VLM fine-tuning. Code and datasets will be released upon acceptance.
Paperid: 3166,   Poster  
Authors: Yijiang Li, Kunal Kotian, Ali Marjaninejad, Meir Friedenberg, Kaushik Pavani, Sunny Dasgupta
Title: RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
Abstract: Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of 1,634 queries requiring reasoning across three categories: functional (object affordances), temporal (timebased relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only 46.53% recall@20 averaged across reasoning categories. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models.
Paperid: 3167,   Poster  
Authors: Yu Chenglong, Shuai Shen, Xiangsheng Li, Yang Li
Title: D$^2$-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment
Abstract: Reconstructing visual semantics from Electroencephalography (EEG) signals enables a deeper understanding of human visual cognition and supports nextgeneration brain–computer interface (BCI) applications.Despite notable advances in recent years, most existing EEG encoders still struggle to capture the frequency-specific neural dynamics that reflect perceptual and cognitive rhythms. Moreover, the cross-modal alignment between EEG and visual content remains insufficiently tackled, leading to limited semantic consistency and visual fidelity. To address these issues, we propose D^2-FOSA, a unified dual-diffusion guided framework with frequency-oriented semantic alignment, which strengthens the frequency-aware EEG representation for more semantically aligned image reconstruction.Specifically, we design a Frequency-Spatio-Temporal Dynamics Encoder (FSTDE) based on the Frequency-Oriented Mamba (FOMamba) to explicitly model oscillatory patterns and long-range dependencies in EEG signals. The extracted features are then pulled into the CLIP-aligned visual semantic space via contrastive learning.Meanwhile, a Dual Diffusion Latent Generator (DDLG) with bidirectional EEG–image conditioning is designed to enforce cross-modal alignment and promote cycle-consistent generation.Extensive experiments on four challenging datasets demonstrate that our proposed D^2-FOSA significantly outperforms existing methods in both retrieval and reconstruction tasks. Particularly, our D^2-FOSA surpasses the contemporary MB2C method by over 20 FID in the reconstruction task on THINGS-EEG, indicating a substantial improvement in perceptual fidelity. The source code is in the supplementary material.
Paperid: 3168,   Poster  
Authors: WANG XINYUAN, Yingxin Lai, Zhiming Luo, Zhihui Liu
Title: PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
Abstract: The rapid rise of highly realistic AIgenerated images necessitates reliable and generalizable detection methods. However, existing methods are constrained by their discriminative nature: by learning a single static decision boundary, they tend to memorize generator-specific artifacts and consequently fail to generalize to the unseen distributions of new generative models. To overcome this limitation, we propose PPM-CLIP, a new framework that shifts from static classification to conditional generative modeling based on the CLIP vision–language model. Instead of learning a fixed decision boundary, a Probabilistic Prompt Modeling (PPM) module is used as a generator that produces an adaptive distribution of prompts according to the input image. This allows the model to flexibly capture novel artifacts, rather than matching them against fixed templates. In addition, to enhance the visual encoder's sensitivity to subtle artifacts, a Patch-Wise Contrastive Learning (PWCL) strategy is introduced. Extensive experiments on Ojha, GenImage, and DRCT benchmarks demonstrate that our generative paradigm significantly outperforms state-of-the-art methods, especially in cross-domain detection. Code will be released on GitHub.
Paperid: 3169,   Poster  
Authors: Korada Sri Vardhana, Soma Biswas
Title: GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
Abstract: Textto-Image (T2I) diffusion models power modern creative tools, but their open-ended generative nature raises safety, ethical, and copyright concerns. Retraining or fine-tuning to remove every unsafe or copyrighted concept is impractical, motivating training-free interventions that suppress specific semantics while preserving general visual quality. Existing guard-railing methods face a core trade-off: they are either rigid, failing to generalize to paraphrased or context-shifted prompts, or coarse, distorting unrelated content and fidelity. We present GenErase (GENeralizable ERAsure with SEmantic Awareness), a training-free, geometry-grounded framework for robust concept removal in diffusion models. GenErase enforces semantic orthogonality in the cross-attention value space via an explicit \empherase-and-replace operation, guided by a per-token preserve projector and a hard geometric gate. This design enables precise erasure, explicit protection of critical semantics, and stability across layers, paraphrases, and multi-concept cases. Extensive experiments on identity, object, and style erasure, together with a new GenBench-40 benchmark, show that GenErase achieves state-of-the-art erasure fidelity and superior paraphrase-level generalization, establishing it as a practical and principled guard-rail for safe, real-time diffusion deployment. Code will be released upon acceptance.
Paperid: 3170,   Poster  
Authors: Binghui Zuo, Lin Zhou, Haoxuan Xu, Jianan Yan, ZhiPeng Yu, Zekai Liu, Yangang Wang
Title: MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
Abstract: Dexterous grasp generation is a predominant task that enables robots to perform humanlevel manipulation. However, a dexterous hand always maintains high-dimensional DoF and actuation space, making existing approaches that rely on holistic latent representations difficult to produce high-quality and semantically aligned grasps. In this paper, we propose MaskDexGrasp to address these challenges. We first present a part-aware grasp tokenizer that decomposes dexterous grasps into discrete tokens, facilitating compositional modeling of anatomical dependencies. Building upon this representation, a bidirectional masked grasp transformer is then developed to predict grasp tokens conditioned on object geometry and task description, ensuring coherent grasp generation while allowing fine-grained part-level editing. To facilitate evaluation, we construct a dexterous grasp dataset that comprises 64K grasping instances and 256K richly annotated descriptions covering 11 tasks. Comprehensive experiments demonstrate that our method achieves the state-of-the-art performance. Our code and dataset will be released upon acceptance.
Paperid: 3171,   Poster  
Authors: Xin Qiu, Wenjie Liu
Title: RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Abstract: Accurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeterwave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose Radar Prior Guided Fusion (RPGFusion), a practical 4D radar–camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconcile geometric and semantic differences between modalities, yielding more consistent and complementary BEV representations. Extensive experiments on the public View-of-Delft and TJ4DRadSet show that RPGFusion outperforms prior radar–camera fusion methods, achieving SOTA performance. Our work not only uses 4D radar signals to guide image BEV queries, but also enables robust radar feature encoding and densification for 3D perception, demonstrating the strong potential of 4D radar.
Paperid: 3172,   Poster  
Authors: Jun Li, Lizhi Xiong, Ziqiang Li, Weiwei Jiang, Zhangjie Fu, Yong Li, Guo-Sen Xie
Title: Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
Abstract: Textto-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets.Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a Text–Image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation.
Paperid: 3173,   Poster  
Authors: Shaolin Wang, Yuying Li, Lei Zhong, Shigang Li, Jianfeng Li
Title: Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation
Abstract: Omnidirectional image superresolution (ODISR) remains challenging due to extreme magnification factors (e.g., 8×, 16×) and projection-specific distortions, which degrade edge integrity and limit model performance. This paper proposes an edge-focused framework combined with spherical geometric augmentation to address these issues. Our approach includes an Edge Focused Block (EFB) that integrates spatial-channel attention via Edge Enhanced and Refined Blocks, strengthening edge feature capture and optimization. We also design an Edge-Aware Multi-Scale (EAM) pipeline, leveraging shallow convolutions for initial feature extraction, local modules for deep mining, and a Global Integration Block for multi-scale aggregation, ensuring coherent edge reconstruction in distorted regions. To mitigate data scarcity, we introduce a rotation-translation augmentation strategy based on spherical projections, expanding datasets while preserving scene continuity. Extensive experiments show our method outperforms state-of-the-art approaches on public datasets.
Paperid: 3174,   Poster  
Authors: Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning
Abstract: Visionlanguage models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes will be released.
Paperid: 3175,   Poster  
Authors: Genki Kinoshita, Shu Nakamura, Ryo Kawahara, Shohei Nobuhara, Yasutomo Kawanishi, Ko Nishino
Title: Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
Abstract: Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms which capture the atomic joint movements and Action Motifs which are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully selfsupervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multiview human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Atoms and Action Motifs that significantly benefit human behavior modeling tasks including action recognition, motion prediction and synthesis.
Paperid: 3176,   Poster  
Authors: Tian Ding, Hongtao Yang, Liangtao Shi, Jun Li, Xiantao Hu, Jian Yang, Ying Tai
Title: Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
Abstract: fails under night scenes, glare, fog, and partial occlusion. Despite notable accuracy gains, recent architectures emphasize deep fusion and large parameter counts, driving up FLOPs and bandwidth. This computational burden constrains realtime performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. For cross-modal interaction, we design a Holistic-Token-Guided Interaction (HTGI) module, where each modality is compressed into a compact set of holistic state tokens and injected into the other modality’s modeling stream without layer-wise alignment, enabling targeted information exchange at extremely low cost. On RGB-T benchmarks, the lightweight tracker substantially reduces latency while maintaining competitive accuracy; on LasHeR, it achieves 70.2% precision and 56.3% success, running at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on an edge device.
Paperid: 3177,   Poster  
Authors: Junhao Du, XUE JIALONG, Anqi Li, Jincheng Dai, Guo Lu
Title: Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Abstract: Video large language models (VideoLLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
Paperid: 3178,   Poster  
Authors: Yue Lei, Siqi Yang, Ting Zhong, Fan Zhou
Title: Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer
Abstract: Music style transfer (MST) aims to reinterpret existing musical pieces in new stylistic forms while maintaining their melodic coherence. Conventional approaches conditioned on text or audio overlook the profoundly multimodal character of musical style. Visual ambience reflected in color, lighting, and composition -- encodes affective attributes that parallel timbre, rhythm, and harmony, which, however, remain underexplored in MST context. We introduce a flow-based, inversion-free framework for multimodal music style transfer that unifies textual and visual guidance. Our approach tackles two challenges: (1) capturing cross-modal semantics beyond language through a dual-encoder fusion module that merges CLIP- and ViT-derived embeddings, and (2) preserving melodic identity using a differentiable normalized chroma constraint that regulates pitch-class consistency along the generative flow. We reorganize and extend the MeLBench and MusicCaps collections into a genre-structured multimodal dataset to support style-aware analysis. Quantitative and perceptual evaluations demonstrate that our approach achieves superior control, structural fidelity, and cross-modal expressiveness, underscoring the role of visual perception in music generation.
Paperid: 3179,   Poster  
Authors: Chenghao Li, Jun Liu, Songbo Zhang, HuaDong Jian, Hao Ni, LIK-HANG LEE, SUNG BAE BAE, Guoqing Wang, Yang Yang, Chaoning Zhang
Title: From Remember to Transfer: Interpretable Open-World Reasoning in MLLMs
Abstract: Multimodal agents, such as JARVIS1, are rapidly advancing in open-world environments. Their core workflow typically follows a perception–reasoning–action–memory cycle. Existing studies primarily emphasize improving memory representations and storage formats, treating memory mainly as an information repository. However, distilling transferable knowledge from stored experiences remains an important yet underexplored challenge.In real-world settings, structures and patterns tend to recur. If an agent can capture and reuse these latent patterns, it can infer new actionable knowledge from prior experience, enabling more efficient and flexible task execution. To explore this capability, we propose Echo. Echo decomposes knowledge into five explicit dimensions of transferability: structure, attribute, process, function, and interaction. Based on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to effectively retrieve past experiences and generalize them to new tasks.Experiments show that, under a from-scratch learning setting, Echo achieves a 1.3×–1.7× speed-up in object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval. These results demonstrate that robust knowledge transfer, driven by effective utilization of contextual examples, is a highly promising direction for advancing open-world multimodal agents.
Paperid: 3180,   Poster  
Authors: Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk, Vishal M. Patel
Title: Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Abstract: Existing openvocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we construct a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB–thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal–Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2–4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.
Paperid: 3181,   Poster  
Authors: Yili Wang, Lu Dai, Tairan Huang, Yijie Xu, Hui Xiong
Title: VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models
Abstract: Machine unlearning (MU) aims to remove sensitive or undesired content from pretrained models. Existing MU methods are commonly characterized as gradually degrading model performance on undesired data to realize approximate forgetting. Despite their successes, the effectiveness in multimodal unlearning tasks remains largely unexplored. In this paper, we first conduct an in-depth analysis and reveal that traditional MU methods tend to disrupt cross-modal alignment, leading to incomplete forgetting in multimodal scenarios. To tackle this challenge, we propose VL-Eraser, a novel unlearning paradigm for VLM unlearning. VL-Eraser reformulates unlearning in VLMs as a two-stage process: distillation and deletion. Specifically, VL-Eraser first introduces a vacuum distillation that disentangles undesired knowledge from the intricate parameters of VLMs and transfers it into low-rank adapters (LoRA). After distillation, unlearning is efficiently achieved by deleting the LoRA parameters from the original model. Extensive experiments across multiple benchmarks demonstrate that VL-Eraser achieves superior unlearning performance while preserving utility compared to the state-of-the-art baselines.
Paperid: 3182,   Poster  
Authors: Dongyue Wang, Yang Lu, Jiandong Tian
Title: Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction
Abstract: Colored glass is widely used in everyday settings, yet its reflective and absorptive properties often introduce ghost shadows and color bias in captured images. However, existing methods typically neglect the absorption issue, making it difficult to address color bias caused by colored glass. To address this, we are the first to apply polarization imaging theory to model the light transmission process within glass. Specifically, we propose a novel imaging model, the Polarization State Tracing Model (PSTM), which traces polarized light along multiple propagation paths and accounts for wavelengthselective absorption, enabling joint reflection removal and color-consistent reconstruction. Guided by PSTM, we design a Channel Ring Attention (CRA) mechanism to efficiently capture inter-angle polarization dependencies and enhance feature interaction across polarization channels, ensuring physically consistent recovery. Besides, the recovered polarization information can be directly applied to advanced downstream tasks, such as Shape-from-Polarization (SfP). We construct a real-world dataset, GlassPol, containing a wide range of glass materials, enabling testing under diverse optical conditions. Extensive experiments show that our method outperforms existing state-of-the-art methods, achieving up to a 3dB improvement in PSNR, establishing a new benchmark for polarized reflection removal.
Paperid: 3183,   Poster  
Authors: Enhui Ma, Jiahuan Zhang, Guantian Zheng, Tao Tang, Shengbo Eben Li, Yuhang Lu, xia zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Zhihui Hao, XianPeng Lang, Kaicheng Yu
Title: DriveCTR: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
Abstract: Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of endto-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we proposeDriveCTR, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers’ cognitive development, we propose a systematicFive-Level Cognitive Ladderthat evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose aRule2Scene Agentthat maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal a pronounced decline in performance as task difficulty increases, especially in the rule conflict resolution task. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCTR in advancing compliant and intelligent autonomous driving systems.
Paperid: 3184,   Poster  
Authors: Long Chen, Hui Wang, Man Xu, Zexuan Li, Zizhu Fan
Title: AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection
Abstract: The YOLO (You Only Look Once) series has been a cornerstone in realtime object detection, renowned for its efficient convolutional design and rapid inference. However, its reliance on convolutional operations inherently limits its ability to capture long-range dependencies and rich contextual information, leading to suboptimal performance in complex scenes. Recently, SSM (State Space Models) have emerged as an efficient alternative to attention mechanisms, offering global representation with linear time complexity. In this paper, we propose AKCMamba-YOLO, a novel object detector that incorporates SSM into the YOLO architecture. We introduce 3CAKCMamba and 4CAKCMamba modules to a novel object detection framework, enabling enhanced channel interaction and cross-layer semantic fusion. This design improves multi-scale feature modeling while maintaining computational efficiency. To support safety-critical applications, we provide railway pedestrian Detection datasets with 2,975 annotated images under complex scenarios. Experiments on COCO2017, power tower foreign object detection datasets, and our custom dataset show that AKCMamba-YOLO achieves superior accuracy and speed compared to state-of-the-art baselines, making it well-suited for real-time and resource-constrained environments.
Paperid: 3185,   Poster  
Authors: Yuxuan Zhao, Zhongao Zhou, Bin Yang, He Li, Jian Liang, Jun Chen, Bo Du, Mang Ye
Title: MSAG: A Multispectral Aerial–Ground Benchmark for Any-Scenario Person Re-Identification
Abstract: Recent person reidentification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world applications require retrieving identities from galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any-Scenario ReID (AS-ReID): given a query from any (modality, viewpoint) scenario, a single model retrieves the same identity from a heterogeneous gallery spanning all scenarios. Progress toward AS-ReID is limited by two factors: (i) the lack of a real-world-aligned benchmark with broad scenario coverage, and (ii) the challenge of learning representations that are cohesive within identities and strongly discriminative across identities under diverse scenarios. To this end, we construct MSAG, a Multispectral Aerial-Ground benchmark with 2,337 identities and 434,620 images captured by RGB, near-infrared, and thermal infrared cameras on both ground and UAV platforms. MSAG spans day-night, multiple seasons, and varied weather conditions, and supports AS-ReID as well as conventional ReID tasks. We further propose the Unified Alignment and Discrimination (UAD) framework. Progressive Center Alignment (ProCA) aggregates multi-view features into modality centers and then aligns them toward identity centers to reduce scenario bias. Global Prototype Discrimination (GPD) contrasts samples against global identity prototypes to enforce large-margin discrimination. Extensive experiments highlight the challenges of MSAG and demonstrate the effectiveness of UAD on AS-ReID. The dataset and code will be released.
Paperid: 3186,   Poster  
Authors: Daixun Li, Zirui Li, Sibo He, Jiayun Tian, Mingxiang Cao, Weiying Xie, Yunke Wang, Xin Zhang, Yusi Zhang, Yunsong Li, Chang Xu, Leyuan Fang
Title: GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective
Abstract: Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multitask reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precision and trustworthy reasoning.To address these limitations, we propose GeoCoT, a manifold-driven mixture-of-experts (MoE) system with Chain-of-Thought (CoT) reasoning. GeoCoT introduces Mani-MoE, a sparse expert architecture grounded in local manifold mapping. It projects high-dimensional tokens onto low-rank subspaces adaptively to eliminate redundancy and uncover intrinsic structure, and then routes them through a sparse expert pathway, where gating decisions are guided by the manifold structure of the input.To optimize this architecture, we adopt a CoT-driven multi-stage training strategy. It leverages a cold-start phase for domain adaptation, followed by our RS Vision Group Relative Policy Optimization (RSV-GRPO) to systematically strengthen structured reasoning from global to objectives. Furthermore, we innovatively build RS-CoT-20k dataset for task-specific supervision.Extensive experiments on multi-task datasets demonstrate that GeoCoT outperforms prior approaches, achieving 5.27 % higher average accuracy than the state-of-the-art method. Our code will be available.
Paperid: 3187,   Poster  
Authors: Xun Jiang, Yufan Gu, Disen Hu, Yuqing Hou, Yazhou Yao, Fumin Shen, Heng Tao Shen, Xing Xu
Title: Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Abstract: Multimodal learning often grapples with the challenge of lowquality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Code will be made publicly available.
Paperid: 3188,   Poster  
Authors: Boce Kang
Title: PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
Abstract: Medical image segmentation demands both high accuracy and computational efficiency, yet existing methods face a critical tradeoff: CNNs lack global context while transformers incur prohibitive costs for deployment on resource-constrained devices. To address this challenge, we propose Physics-informed Multi-scale Refinement Network (PMRNet), integrating symplectic geometry, renormalization group theory, and entropy diffusion to guide feature learning. PMRNet features three innovations: (1) a physics-informed encoder with Enhanced Symplectic Convolution for boundary detection and Renormalization Group-informed Downsampling for information preservation; (2) a Pseudo-Global Receptive Field module achieving near-global context with linear complexity through entropy-driven diffusion; and (3) a boundary-aware decoder for precise delineation. With only 0.87M parameters and 3.43 GFLOPs, PMRNet achieves 87.25% IoU and 92.56% Dice on the challenging Clinic dataset, outperforming state-of-the-art (SOTA) models with even 100× more parameters across 12 medical imaging datasets while maintaining computational efficiency.
Paperid: 3189,   Poster  
Authors: Ting Yang, Qilong Wang, Qibin Hou, Qinghua Hu
Title: Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
Abstract: The rise of visionlanguage models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limiting the final segmentation performance. Therefore, we propose a plug-and-play yet effective Test-time Multi-Prompt Adaptation (TMPA) method to mitigate textual ambiguity in OVRSIS. Specifically, our TMPA first generates a group of diverse, context-aware descriptions for each category instead of the naive class name by executing a large language model with a task-driven prompt, which can effectively avoid some textual ambiguity, i.e., background class has different meanings in various tasks. Furthermore, TMPA develops a visual-guided test-time adaptation strategy for the generated multi-prompts, which adaptively refines the prompt representations of each category with high-confidence visual features for the uncertain predictions with high entropy, making our TMPA better applicable to different scenarios. Particularly, a pixel-level loss with entropy minimization is proposed to optimize the text prompt with a bias during inference, where prompt bias is constructed based on a weighted combination of high-confidence visual features. Our TMPA can be flexibly integrated into existing methods for boosting their performance. Extensive experiments are conducted on 17 remote sensing datasets, and the results show our TMPA can significantly improve its counterparts, while achieving state-of-the-art performance.
Paperid: 3190,   Poster  
Authors: yujia wang, Yuyan Li, Jiuming Liu, Fang-Lue Zhang, Xinhu Zheng, Neil Dodgson
Title: RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
Abstract: Blind 360° image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360° content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpathbased approaches have attempted to model viewing behaviors by approximating the human view‑then‑rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL‑ScanIQA, a reinforcement‑learned framework for blind 360° IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross‑dataset robustness using distortion‑space augmentation together with rank‑consistent losses that preserve intra‑image and inter‑image quality orderings. Extensive experiments on three benchmarks show that RL‑ScanIQA achieves superior in‑dataset performance and cross‑dataset generalization. Code will be released upon publication.
Paperid: 3191,   Poster  
Authors: Peng Wu, Jiapeng Zhang, Yingjie Song, Xiong Xiao, Zhuo Tang
Title: FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning
Abstract: Federated Learning (FL) enables collaborative model training without sharing raw data, but client data are often NonIndependent and Identically Distributed (Non-IID), which often slow convergence and degrade global performance. Meanwhile, privacy preservation is also a critical concern in FL. To address these two issues, we propose FedAlign, a differentially private framework that aligns local data distributions via client-side statistical moment alignment. Clients upload perturbed distribution statistics, which the server aggregates to infer global distribution characteristics and guide local alignment, thereby reducing inter-client discrepancies. Experiments and theoretical analysis show that FedAlign accelerates convergence and improves accuracy under Non-IID settings while preserving rigorous privacy guarantees.
Paperid: 3192,   Poster  
Authors: Zhongjie Ma, Di Lin, Xin WANG, Haotian Dong, Chong Wang, Dongdong Wu, Changqing Zhang
Title: BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structurefrom-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement, an Adaptive Kalman Filter refines each primitive’s position by recursively fusing noisy gradient observations with spatial priors, dynamically adjusting its covariance according to local uncertainty.This hierarchical Bayesian formulation effectively bridges probabilistic distribution modeling and uncertainty-aware optimization, resulting in improved reconstruction quality under sparse-view conditions. Experiments across multiple benchmark datasets including Tanks and Temples, MVimgNet, and LLFF demonstrate that our method consistently outperforms existing approaches.
Paperid: 3193,   Poster  
Authors: Xin Ma, Peng Lu, Yisong Chen, Chengwei Pan, Sheng Li
Title: CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis
Abstract: Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce ContextAware Gaussian Splatting (CoRoGS), a Context-aware framework for Robust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structural consistency under substantial view deviation, we incorporate a progressive graph expansion strategy that adaptively grows and prunes Gaussians, leading to more coherent and complete scene reconstructions. Extensive experiments demonstrate that CoRoGS outperforms state-of-the-art 3DGS-based methods, producing higher-quality results. We highlight that CoRoGS robustly handles a wide range of view shifts, including lateral deviations (e.g., lane-level offsets) and cross-level transitions such as from ground-level driving views to elevated perspectives.
Paperid: 3194,   Poster  
Authors: Qianqian Tang, Jinchi Zhu, Xiaolu Zhou, Yongchao Xu
Title: Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration
Abstract: Bamboo slips are essential media for recording ancient East Asian civilizations, but excavated slips often suffer severe deformation due to dehydration and stress effects, creating substantial challenges for restoration. Traditional manual restoration is timeconsuming and risks damage, while existing generative models struggle with the complex non-linear deformations in bamboo materials.We propose a novel framework for inverse restoration of deformed bamboo slips that provides a progressive physical deformation modeling with stepwise inverse displacement prediction. Our approach establishes a computable mathematical model of deformation based on wood fiber microstructure and stress-diffusion coupling effects, enabling the forward process to simulate physically plausible deformation trajectories as a deterministic, physics-driven progressive evolution. The inverse process transforms from predicting abstract noise to learning physically meaningful inverse displacement fields that progressively restore deformations.Experimental results show substantial gains in restoration fidelity while preserving delicate textual features, enabling the reliable correction of complex non-linear deformations that defeat traditional techniques. By integrating physical insights into bamboo material behavior with progressive restoration modeling, this work establishes a new paradigm for digital archaeological restoration—one that holds significant potential to transform how deformed cultural relics are reconstructed and studied.
Paperid: 3195,   Poster  
Authors: Jiayi Fan, Zheyun Qin, Xiaoming Xi, Xiushan Nie, Yilong Yin
Title: SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
Abstract: The synergistic framework of multimodal large language models (MLLMs) and vision foundation models demonstrates exceptional performance in image understanding tasks, yet encounters severe temporal inconsistency challenges in video segmentation scenarios. Existing methods predominantly rely on MLLMs trained on static images to generate perframe segmentation prompts, neglecting the physical continuity of video motion. This paper posits that performance limitations in video understanding tasks from inadequate constraints on model output behavior. Consequently, we propose a spatiotemporal co-optimization mechanism that achieves temporally consistent video segmentation solely by constraining MLLM output behavior, eliminating the need for large-scale video pretraining or complex architectural modifications. Our method features two complementary mechanisms: a Brownian bridge loss that models object trajectories as endpoint-constrained Gaussian processes to ensure temporal smoothness, and a geometry-aware prompt quality loss that enforces spatial consistency with target structures. Experiments on referring expression video segmentation and reasoning video segmentation tasks demonstrate that our method significantly surpasses state-of-the-art techniques on the Ref-YouTube-VOS, Ref-DAVIS-2017, MeVIS, A2d-Sentences, JHMDB-Sentences and ReVOS benchmarks. This work establishes that explicit modeling of physical world constraints can unlock the full potential of statically trained foundation models in dynamic visual understanding tasks.
Paperid: 3196,   Poster  
Authors: Chao Ning, Minghe Shen, Naoto Yokoya
Title: MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
Abstract: We study monocular metric depth estimation (MMDE) without camera intrinsics at training or inference. When focal length and scene depth vary together, depth changes are difficult to perceive from image, yet the edgefrequency statistics exhibit systematic, scale-correlated shifts. Building on this observation, we introduce a spectral quantile estimator (SQE) that analyzes the Fourier spectrum of a predicted edge map and outputs a single score used as a proxy for metric scale. We propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. Across diverse cameras and datasets, MD2E achieves state-of-the-art monocular metric depth in both zero-shot and fine-tuning settings without camera metadata.
Paperid: 3197,   Poster  
Authors: Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem
Title: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment in CLIP
Abstract: CLIP achieves strong zeroshot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long detailed captions. In this work, we propose a multi-granular text-conditioned contrastive learning framework, \beta-CLIP, to achieve hierarchical alignment across multiple textual granularities -- from full captions to sentences and phrases -- and their corresponding visual regions. For each level of textual granularity, \beta-CLIP uses cross-attention to dynamically pool image patches, producing contextualized visual embeddings. A \beta-weighted contrastive objective jointly optimizes multi-granular text–contextualized visual pairs, with both soft cross-entropy and hard binary cross-entropy formulations, enabling controllable intra-image competition and balanced fine-to-coarse alignment. Through extensive experiments on various benchmarks with diverse granularities, we show that \beta-CLIP achieves 30.9% on FG-OVD (Hard) and, on long-text retrieval, 63.6% I2T R@1 on DCI and 92.2% T2I R@1 on Urban1K, reaching the state-of-the-art among methods not trained with Hard Negatives. \beta-CLIP establishes a strong, adaptive baseline for dense vision–language correspondence.
Paperid: 3198,   Poster  
Authors: Jixin Zhao, Zhouxia Wang, Peiqing Yang, Shangchen Zhou
Title: Precise Object and Effect Removal with Adaptive Target-Aware Attention
Abstract: Object removal requires eliminating not only the target object but also its associated visual effects such as shadows and reflections. However, diffusionbased inpainting and removal methods often introduce artifacts, hallucinate contents, alter background, and struggle to remove object effects accurately. To address these challenges, we propose ObjectClear, a novel framework that decouples foreground removal from background reconstruction via an adaptive target-aware attention mechanism. This design empowers the model to precisely localize and remove both objects and their effects while maintaining high background fidelity. Moreover, the learned attention maps are leveraged for an attention-guided fusion strategy during inference, further enhancing visual consistency. To facilitate the training and evaluation of this framework, we construct OBER, a large-scale dataset for OBject-Effect Removal, which provides paired images with and without object-effects, along with precise masks for both objects and their effects. The dataset comprises high-quality captured and simulated data, covering diverse objects, effects, and complex multi-object scenes. Extensive experiments demonstrate that ObjectClear outperforms prior methods, achieving superior object-effect removal quality and background fidelity, especially in challenging real-world scenarios. Code and dataset will be released.
Paperid: 3199,   Poster  
Authors: Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Bishop, Zhiyong Wang, Kun Hu
Title: PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
Abstract: Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for singlecrop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.
Paperid: 3200,   Poster  
Authors: Yi Yang, Gaoyang Zhang, Jun Tan, Xinguo Liu
Title: Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
Abstract: 3D reconstruction of transparent objects from multiple views has been a longstanding challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortion, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel IoRNetwork to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refractive visual distortion. To deal with dual-layered transparent and opaque objects, we devise a two-stage hierarchical reconstruction strategy that decouples outer and inner geometry, combined with alpha-blending for transparency-aware surface separation. Experiments show that Opti-NeuS achieves practical utility and effectiveness and outperforms prior works.
Paperid: 3201,   Poster  
Authors: Haoqing Wu, Alexa Nawotki, Jochen Garcke
Title: PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
Abstract: Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, realworld point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, paving the way for more versatile 3D perception.
Paperid: 3202,   Poster  
Authors: Shengdong Xue, Haoxiang Ma, Hao Chen, Zhen Yang, Yongjian Deng
Title: Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations
Abstract: Eventbased motion deblurring has attracted increasing attention as the high temporal resolution of event cameras provides motion cues unavailable to RGB sensors, enabling stronger deblurring. In real-world scenes, motion blur is often complex and nonlinear, with different regions exhibiting diverse speeds and directions. However, most existing approaches rely on handcrafted event representations that overlook such spatiotemporal motion heterogeneity, resulting in suboptimal deblurring performance. To address this issue, we design a learnable 3D Gaussian event representation module that adaptively selects key spatiotemporal coordinates beneficial for deblurring based on the distributions of the blurred image and event density, and integrates the event stream using a 3D Gaussian weighting kernel, thereby extracting local motion features sensitive to motion direction and velocity. In addition, to fully exploit the motion information aggregated in our event representation, a two-stage fusion strategy is employed. Local motion features are used in the first stage to enhance detail restoration, followed by a bidirectional attention fusion module that leverages the one-dimensional Gaussian-weighted event frames for global position correction, thereby achieving precise alignment of the overall structure. Extensive experiments on synthetic and real-world datasets validate the effectiveness of our approach and yield a substantial improvement over state-of-the-art methods.
Paperid: 3203,   Poster  
Authors: Hailin Luo, Yifan Yang, Jiazhi Shu, Zixiong Huang, Qi Chen, Qing Du, Mingkui Tan
Title: Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
Abstract: Reconstructing highfidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure coherent surface coverage. To ensure the stability of this analytical mapping, we introduce a crucial equilateral regularization term that preserves mesh integrity. Extensive experiments demonstrate that Tavatar achieves superior animation robustness on challenging out-of-distribution poses, reducing normal error by 13.8% on X-Avatar and 17.9% on PeopleSnapshot against the best baseline, while maintaining competitive rendering quality.
Paperid: 3204,   Poster  
Authors: Chengsheng Zhang, Chenghao Sun, Xinyan Jiang, Wei Li, Xinmei Tian
Title: Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Abstract: Large VisionLanguage Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses.While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs.To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs.Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source.Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance.
Paperid: 3205,   Poster  
Authors: Junhyuk Seo, SANG HYUK SEO, Dawoon Kim, Heeseok Oh
Title: R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
Abstract: Prevailing noreference 3D point cloud quality assessment methods predominantly treat 2D projections and 3D point clouds as independent modalities and rely on simplistic feature fusion, thereby neglecting fundamental mechanisms underlying human 3D perception. To address this limitation, we introduce R3-PCQA (Ray-Reprojection-Reinforcement 3D Point Cloud Quality Assessor), a novel and principled framework that explicitly encodes perceptual priors into the assessment pipeline: A geometric-aware ray-based reprojection pipeline simulates viewpoint-dependent observation of 3D structure. A reinforcement-learning-based quality-salient subcloud selector adaptively attends to perceptually informative regions. The global view attention module aggregates local quality responses across viewpoints, forming a unified representation that facilitates reliable cross-view inference. Extensive experiments demonstrate that R3-PCQA achieves state-of-the-art performance on SJTU-PCQA, WPC, and WPC2.0.
Paperid: 3206,   Poster  
Authors: Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Shufan Yang, Qing Gu
Title: Rethinking BCE Loss for Multi-Label Image Recognition with Fine-tuning
Abstract: Finetuning vision–language models such as CLIP has become the mainstream paradigm for multi-label image recognition, and prompt tuning is widely adopted due to its lightweight parameter cost and strong transferability. However, we find that when these methods use Binary Cross-entropy as the supervision loss, the model’s confidence structure becomes systematically distorted, leading to pronounced miscalibration. Existing calibration techniques, such as temperature scaling or regularization-based methods, largely fail in multi-label settings because they cannot capture inherent semantic dependencies between classes, nor can they correct the global structural shifts introduced during fine-tuning. To address this issue, we propose Class-wise Covariance Regularization, which aligns the predicted covariance structure of class confidences with the semantic correlations encoded in pretrained text embeddings. This alignment preserves the geometric consistency of the class space throughout fine-tuning, resulting in more stable and interpretable confidence distributions across categories. Experiments on multi-label benchmarks show that CCR significantly reduces calibration errors while maintaining or even improving classification accuracy.
Paperid: 3207,   Poster  
Authors: Thanh Van Le, Yun Fu
Title: VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
Abstract: While opensourcing instruction-guided image editing models accelerates research, it surrenders control over their capabilities to anyone who downloads the weights.Existing protection methods are reactive: they verify ownership after generation, but the underlying model remains fully functional for unauthorized users.We introduce Visilock, where access control is baked into model weights, rendering the model unusable without a visual trigger in the input.The challenge is training a model that retains editing capability for authorized input and remains unusable for unauthorized input, without destabilizing training.Naive multi-task objectives create gradient conflicts that collapse training, while contrastive approaches like FMLock destroy the denoising manifold.We develop Diverged Score Distillation, a dual-teacher framework where a degraded teacher defines locked behavior and an original teacher guides editing quality, eliminating gradient interference through separate frozen targets.A key risk is that released models could be unlocked through post-hoc fine-tuning. To prevent this, we initialize the student model from the degraded teacher so that it begins in a locked state, and only regains editing ability for authorized inputs via distillation. This impedes adversarial fine-tuning from recovering full editing capability.Evaluation on InstructPix2Pix shows authorized edits maintain baseline quality (CLIP-I: 0.821, DINO: 0.726) while unauthorized attempts degrade substantially (CLIP-I: 0.481, DINO: 0.072) with 41% and 90% drops in image and semantic similarity.The lock remains robust to key corruptions, spatial perturbations, and adversarial unlock fine-tuning.Code and data will be available for research purposes.
Paperid: 3208,   Poster  
Authors: Linjie Li, HUIYU XIAO, Jiarui Cao, Zhenyu Wu, JI Yang
Title: Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning
Abstract: Classincremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all previously seen classes. A key challenge of CIL lies in the discrepancy between clear task boundaries during training and blurred boundaries during inference, where samples from different tasks often occupy overlapping subspaces. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Furthermore, we employ lightweight adapters to adapt PTMs to downstream tasks while freezing previously learned adapters. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces and jointly calibrate the unified classifier. Extensive experiments on five benchmark datasets demonstrate that QKD effectively mitigates catastrophic forgetting and achieves state-of-the-art performance in class-incremental settings.
Paperid: 3209,   Poster  
Authors: Le Jiang, Yan Huang, Zhen Xu, Yong Xu, Hau San Wong, Si Wu
Title: Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection
Abstract: Modern industrial quality control heavily relies on automated anomaly detection. While fewshot anomaly detection addresses the challenge of limited labeled data, real-world inspection faces a vast diversity of anomaly types, sizes, and shapes. We identify the primary cause for the anomaly detection difficulty as the progressive loss of detect cues as they pass through deep feature extraction pipelines. To counteract the defect cue fading, we propose a Defect Cue-Preserved Structural Feature Refinement model, referred to as DCP-SFR. Recognizing that early-stage cues are paramount, we design a conditional anomaly cue amplification module to produce an initial anomaly score map, which is then enhanced to increase the contrast between anomalous and normal regions. The amplified cues is subsequently used for reconstruction-based anomaly localization, by anchoring attention on true anomaly regions to preserve spatial integrity and prevent drift. Further, we incorporate a structure-aware segmentation refinement stage to improve anomaly segmentation in terms of edge alignment, thereby significantly improve boundary accuracy. On the MVTec AD and VisA benchmarks, DCP-SFR achieves state-of-the-art performance, with an image-level AUROC of 97.3% and a pixel-level AUROC of 98.2%, demonstrating strong cross-domain generalization performance.
Paperid: 3210,   Poster  
Authors: Yifan Liu, Fangneng Zhan, Wanhua Li, Haowen Sun, Katerina Fragkiadaki, Hanspeter Pfister
Title: RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
Abstract: Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in realworld scenarios, causing a sim-to-real gap.Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistency supervision across branches can be applied. This design allows us to utilize in-the-wild images as training data without annotations. Experimental results demonstrate that our method is effective across robot types, highlighting its potential to alleviate the data bottleneck in robotics.
Paperid: 3211,   Poster  
Authors: Kecheng Ye, Mao Chen, Xiangkai Zhang, Xu Yang
Title: Fusion of Depth and Semantic for Probabilistic Floorplan Localization
Abstract: Floorplan localization aims to estimate the camera pose of a query image with respect to a 2D floorplan, providing a lightweight and longterm stable alternative to localization based on 3D maps or large image databases for indoor robotics and AR. Recent methods frame the problem as ray-based matching, representing the image as a set of rays annotated with depth or semantic labels and aligning them with the floorplan. However, they still face challenges in addressing the complexity of indoor environments, which can be decomposed into environmental, geometric, and semantic ambiguities.To address these ambiguities, we propose a floorplan-aware probabilistic fusion framework that models both depth and semantic information within a unified architecture. Our framework also combines a distribution-based ray confidence estimator, which down-weights uncertain geometric hypotheses, with a probabilistic semantic matching scheme based on Jensen–Shannon divergence (JSD), which preserves and leverages informative semantic ambiguity instead of collapsing it into hard labels. Experiments on challenging benchmarks demonstrate that our approach significantly outperforms prior methods in both robustness and accuracy.
Paperid: 3212,   Poster  
Authors: Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Josh Hansen, Andrew Howe, Patrick Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema
Title: Helios: Stable Latent Image Modeling for Multimodal Earth Observation
Abstract: Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present Helios: a multimodal, spatiotemporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. Helios achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings Helios achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy Helios as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The Helios platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. Helios source code, training data, and pre-trained weights are available at REDACTED.
Paperid: 3213,   Poster  
Authors: Song Lai, Zhe Zhao, Fei Zhu, Ji Cheng, Xi Lin, Qingfu Zhang, Gaofeng Meng
Title: Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
Abstract: Rehearsalbased methods are the cornerstone of modern online class-incremental learning (OCIL), yet they face a fundamental challenge: the gradient of the current task often conflicts with that of the rehearsal data from the memory buffer, leading to catastrophic forgetting. Recent works have implicitly addressed this by using hypergradients, but the underlying mechanism has remained poorly understood. In this paper, we first provide a formal analysis revealing that hypergradients mitigate forgetting by aligning task-specific gradients towards a common meta-objective, thereby reducing their conflict. However, we argue that this conflict-reducing alignment is inherently myopic—it only considers the immediate gradient directions, failing to account for the loss landscape geometry just one step ahead. To overcome this limitation, we introduce a novel framework: Lookahead Optimization for Rehearsal (LOR). Instead of committing to a single update, LOR first explores a set of potential future model states by taking lookahead steps along different directions that balance plasticity and stability. To ensure the final update is robust, we formulate the optimization as a min-max problem, seeking parameters that perform well even under the worst-case lookahead scenario. This objective is made tractable by a smooth Log-Sum-Exp approximation, enabling efficient end-to-end training. Theoretical analysis from both optimization and statistical perspectives corroborates the robustness of our approach. Extensive experiments on Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet demonstrate that LOR significantly outperforms state-of-the-art methods, establishing a new and more robust paradigm for rehearsal-based OCIL.
Paperid: 3214,   Poster  
Authors: Mianzhao Wang, Fan Shi, Xu Cheng, Chen Jia, Shengyong Chen
Title: Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
Abstract: Unaligned RGBT salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty estimation to adaptively regulate the fusion of RGB and thermal features, enhancing salient cues in reliable regions while suppressing noisy or inconsistent information. Extensive experiments on five unaligned and three aligned RGB-T SOD benchmarks demonstrate that UMFNet achieves state-of-the-art performance across diverse alignment conditions.
Paperid: 3215,   Poster  
Authors: Jinyang Bo, Fan Dou, Wenrui Quan, Shangxun Liu, Yang Xu, Yuhe Zhang, Kang Li, GuoHua Geng
Title: GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
Abstract: 3D Gaussian splatting (3DGS) has emerged as a promising approach for highfidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material maps and the environment map. The third stage utilizes factorized inverse path tracing (FIPT) on the G-buffer rendered by the PGSR, and visibility and indirect illumination are evaluated by hardware-accelerated ray tracing on the mesh with the material maps and the environment map reconstructed in the second stage. Experiments demonstrate that the relighting performance of GHPT outperforms the baselines, and our method can perform real-time relighting and composition of Gaussian splatting.
Paperid: 3216,   Poster  
Authors: Ziye Geng, Guang Yang, Yihang Chen, Changqing Luo
Title: IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness
Abstract: We propose IrisFP, a novel adversarialexample-based model fingerprinting framework that enhances both uniqueness and robustness by leveraging multi-boundary characteristics, multi-sample behaviors, and fingerprint discriminative power assessment to generate composite-sample fingerprints. Three key innovations make IrisFP outstanding: 1) It positions fingerprints near the intersection of all decision boundaries—unlike prior methods that target a single boundary—thus increasing the prediction margin without placing fingerprints deep inside target class regions, enhancing both robustness and uniqueness; 2) It constructs composite-sample fingerprints, each comprising multiple samples close to the multi-boundary intersection, to exploit collective behavior patterns and further boost uniqueness; and 3) It assesses the discriminative power of generated fingerprints using statistical separability metrics developed based on two reference model sets respectively for pirated and independently-trained models, and assigns fingerprint-specific thresholds to retained fingerprints. Extensive experiments show that IrisFP consistently outperforms state-of-the-art methods, achieving reliable ownership verification by enhancing both robustness and uniqueness.
Paperid: 3217,   Poster  
Authors: Muhammad Zarar, Mingzheng Zhang, Xiaowang Zhang, Zhiyong Feng
Title: NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction
Abstract: Scene Graph Generation (SGG) aims to structurally represent visual scenes by detecting objects and their pairwise relationships. Despite significant progress, current models encode visual knowledge with ambiguous visual context and logically inferred implicit relations due to their purely neural, pipelinebased nature. This limitation underscores the need to advance beyond identifying what relations exist to explaining why they exist and how they can be compositionally reasoned about through logical rule chaining. To address these challenges, we introduce NeuroRule, the first Neurally-Guided Rule Induction Network that integrates Mask2Former pixel-precise visual understanding with a differentiable rule induction engine. Our proposed method enables automatic learning of compositional logical rules directly from visual data while providing transparent explanations for relational predictions. NeuroRule introduces three key innovations: (1) a neural-symbolic bridge that maps visual features to probabilistic symbolic representations; (2) a differentiable rule-learning mechanism that automatically discovers interpretable first-order logic rules without manual engineering; and (3) a compositional chain rule system that enables complex inference while propagating confidence scores through an end-to-end trainable pipeline. Extensive experiments on the benchmark datasets, including Visual Genome (VG), Panoptic Scene Graph (PSG), and OpenPSG, demonstrate that NeuroRule achieves state-of-the-art performance. Our method significantly improves few-shot relation extraction while maintaining full interpretability in its rule-based explanations. To ensure reproducibility, we will release the code after publication.
Paperid: 3218,   Poster  
Authors: HAO ZHANG, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma
Title: MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multimodal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.
Paperid: 3219,   Poster  
Authors: Xiuting Weng, Ruizhi Pu, Yuanhang Yao, Kun Yue, Zhiwen Tang, Lixing Yu
Title: GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment
Abstract: Federated Learning (FL) is a decentralized framework that not only enables collaborative training with different clients but also ensures their local data privacy. However, when deletion requests arise under privacy regulations, efficiently removing specific client data contributions from target clients can be challenging. Existing unlearning methods face significant limitations under NonIID (Non-Independent and Identically Distributed) data distributions when attempting to unlearn specific target clients in FL. Models in sharp optimization regions can suffer catastrophic knowledge loss from minor parameter changes, exacerbating this forgetting due to conflicting parameter updates across clients caused by Non-IID data distributions in FL. Empirically, we observe that conflicting updates under Non-IID settings generate misaligned task vectors that fail to isolate target knowledge. Therefore, we exploit the loss landscape geometry in unlearning specific target clients. We demonstrate that migrating models to flat regions can enhance unlearning robustness in Non-IID FL. Correspondingly, we introduce GDFA, a framework that initially transitions the global model to a flat loss domain. Subsequently, relevant clients generate unlearning task vectors, which GDFA filters to retain only directionally consistent components. This process isolates shared knowledge attributes before precise removal through reverse vector aggregation, maximizing knowledge retention. Extensive experiments demonstrate that GDFA outperforms state-of-the-art methods in unlearning efficacy and efficiency across diverse datasets and architectures, with minimal accuracy loss on retained tasks.
Paperid: 3220,   Poster  
Authors: Xin Cai, Zhiyuan You, Zhoutong Zhang, Tianfan Xue
Title: DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Abstract: Reducing token number in the latent diffusion is important for both efficient training and inference, especially at high resolution.A common approach is to design highcompression image tokenizers that store more information per token by increasing the number of channels. However, packing more details into each token tends to make the latent space less structured, which in turn makes diffusion training difficult. To solve this, current solutions use semantic alignment or training-time dropout to impose structures in the latent space, which often requires retraining the diffusion model from scratch. Can we increase the compression ratio of the image tokenizer, while not requiring expensive re-training? As we find out, a simple solution is to explicit add channels to the existing latent to capture image details, and align them towards the latent from the pre-trained diffusion model. Our method, Detail-Aligned VAE, increases the compression ratio of a pretrained VAE, while only require a light-weight adaptation stage for the corresponding pretrained diffusion backbone. Specifically, DA-VAE imposes an explicit latent structure: the first C channels of the latent space is given by the pre-trained VAE, encoding the input image at half the resolution. We use an extra of D channels to encode details of the image at full-res.To make this new latent diffusion friendly, we introduces a simple detail alignment strategy that constraints the extra D channels to have similar structures of the first C channels. With such a design, we provide a warm-start finetuning recipe which effectively enables 1024× 1024 image generation with Stable Diffusion 3.5, using only 32×32 tokens, 4× less than the original model.This adaptation only takes 5 H100 days. We also show that we could unlock 2048×2048 image generation with SD3.5, with 6× speed up and more stable image structure. We further validate the effectiveness of our method and design decisions quantitatively on ImageNet.
Paperid: 3221,   Poster  
Authors: Chenhui Zhang, Guoqing Dong, WeijiePeng WeijiePeng
Title: ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
Abstract: Multiobject tracking (MOT) based on unmanned aerial vehicle (UAV) aims to identify and continuously track the positions of multiple ground targets during UAV flight. Current mainstream methods utilize appearance matching and motion matching to match targets in consecutive frames. However, these methods often fail in the following scenarios: First, scenarios with multi-scale targets, where small targets have weak appearance features and small bounding boxes; second, scenarios with complex backgrounds or occlusions, where the background or occlusions interfere with the appearance features and change the bounding box size of targets; third, scenarios where the UAV lens shakes, rotates, or zooms, leading to misalignment between consecutive frames; and fourth, scenarios with high targets similarity, where the appearance features between targets are difficult to distinguish, such as vehicles on a road. To address these issues, we propose a multi-object tracking algorithm, ProgTrack, based on a multi-stage progressive matching mechanism. This algorithm simulates human eye tracking strategies, employing a progressive process of "first matching easily matched large targets, then matching difficult-to-match small targets, and finally matching the remaining mixed-scale targets." Similarly, ProgTrack employs three strategies for target matching at different scales and appearances: a simple Local Motion Information (LMI) matching strategy for large targets, a complex Context Enhancement Feature (CE-Feature) matching strategy for small targets, and a Global Motion Information (GMI) matching strategy for multi-scale targets matching, thereby achieving target matching. On the VisDrone2019 UAV tracking dataset, ProgTrack achieves MOTA, MOTP, and IDF1 scores of 40.2, 77.5, and 52.8, respectively, demonstrating state-of-the-art performance among ten methods.
Paperid: 3222,   Poster  
Authors: Gabriel Fiastre, Antoine Yang, Cordelia Schmid
Title: CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatiotemporal details and describe them in natural language.Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance.To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories.With pretraining on LVISCap and LV-VISCap, CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT.
Paperid: 3223,   Poster  
Authors: Huaizhi Qu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Tianlong Chen
Title: $\texttt{MonoVLM}$: Monocular 3D Visual Grounding with Vision Language Models
Abstract: VisionLanguage Models (VLMs) have demonstrated remarkable capabilities in instruction following and 2D visual understanding. However, state-of-the-art VLMs, including GPT-5 still struggle with 3D perception, particularly in tasks such as monocular 3D visual grounding. While specialized vision-only models excel in this domain, they often lack the rich semantic understanding inherent to VLMs. To bridge this gap, we propose \textttMonoVLM, a novel triple-stage training framework that effectively enables VLMs with accurate monocular 3D grounding. The core of our method is a progressive training process, which utilizes Group Relative Policy Optimization (GRPO) to gradually teach the model to first localize the described object, then understand its 3D structure, and finally, perform accurate estimation. Comprehensive experiments show that \textttMonoVLM models significantly outperform existing VLMs and even surpass the performance of specialized vision-only models. We validate our design via extensive comparisons and ablation studies.
Paperid: 3224,   Poster  
Authors: Xuan Lu, Kangle Li, Haohang Huang, Rui Meng, Wenjun Zeng, Xiaoyu Shen
Title: Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding
Abstract: Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarsegrained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities.To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval)—a large-scale benchmark designed to evaluate fine-grained, multi-condition retrieval under natural-language queries. MCMR spans five product domains—upper and bottom clothing, jewelry, shoes, and furniture—and preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance.We benchmark a diverse suite of MLLM-based multimodal retrievers and vision–language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query–candidate consistency.Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding.
Paperid: 3225,   Poster  
Authors: Mario Markov, Stefan Ailuro, Luc Van Gool, Konrad Schindler, Danda Paudel
Title: FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
Abstract: Predicting wildfire risk is a reasoningintensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
Paperid: 3226,   Poster  
Authors: Deyu Bo, Xinchao Wang
Title: Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching
Abstract: Condensing the largescale, high-resolution ImageNet-1K dataset remains a challenge for dataset distillation (DD). Existing methods typically match batch normalization (BN) statistics, \ie, mean and variance, between real and synthetic datasets. Although effective with soft labels, their performance degrades substantially under hard labels. In this paper, we theoretically identify that BN matching mainly aligns the scales of real and synthetic gradients but overlooks their directions. However, experimental evidence demonstrates that gradient direction, rather than scale, is pivotal to model training, clarifying the limitations of prior methods. Building on this insight, we introduce Orthogonal Gradient Matching (OGM), which explicitly aligns the intrinsic direction of gradients, \ie, singular vectors. Specifically, OGM first orthogonalizes real and synthetic gradients by setting all singular values to one, eliminating their scales, and then minimizes the distance between these orthogonal gradients so that their singular vectors coincide. To further reduce computation, OGM employs a least-squares loss whose gradients can be obtained in the forward pass, avoiding back-propagation. Extensive experiments on ImageNet-1K validate the effectiveness of OGM. With only ten images per class (IPC = 10), OGM achieves 47.0% accuracy with soft labels and 16.7% with hard labels, outperforming training-based DD methods and RDED.
Paperid: 3227,   Poster  
Authors: Xiao Zitong, Yuda Qiu, Zisheng Ye, Xiaoguang Han
Title: OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
Abstract: We propose OMGTex, an endto-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains.To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on facial texture benchmarks. Both the dataset and the pretrained model weights will be publicly released.
Paperid: 3228,   Poster  
Authors: Ziyi Gao, Zhipeng Wei, Jingjing Chen, Stewart Tan, Hao li, Yi-Ping Phoebe Chen
Title: Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
Abstract: Narrative image generation aims to create images featuring multiple distinct characters while capturing their interrelationships, posing significant challenges for current textto-image diffusion models. As a result, general personalized methods often suffer from poor semantic alignment, identity blending, and aesthetic implausibility.These issues are inadequately captured by existing evaluation metrics such as CLIP, ArcFace, and conventional reward models, which fundamentally fail to align with human perceptual preferences. To align with human preferences, we first construct a fine-grained human preference dataset, NI-RLHF, by collecting both detailed human critiques and preference judgments across three core dimensions: prompt following, identity consistency, and visual quality.This comprehensive dataset facilitates the training of NIReward, a critique-based reward model capable of generating interpretable image evaluations.Building upon the interpretable reward signal from NIReward, we propose Adaptive Dominance-based Preference Optimization (ADPO) to balance learning across diverse preference dimensions while dynamically adapting to reward margins.Experimental results indicate that NIReward significantly outperforms existing evaluation models and reward models, and ADPO yields a significant improvement across the three key preference dimensions. By introducing NIReward and ADPO, our work paves the way for generating narrative images aligned with actual human preferences.
Paperid: 3229,   Poster  
Authors: Boyu Wang, Jun Xia, Mingsong Chen
Title: Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting
Abstract: Although current watermarking techniques for 3D Gaussian Splatting (3DGS) are promising in protecting the copyrights of both 3DGS models and their rendered images, they greatly suffer from low watermark robustness and poor rendering quality when applying quantization to large 3DGS models to accommodate resourcelimited devices. To address these problems, this paper introduces a novel two-stage quantization-aware 3DGS watermarking approach called Robust3DGSW. By properly embedding watermarks into the mid-frequency bands of both the 3D Gaussian parameters and 2D rendered images, the first stage of Robust3DGSW can effectively counteract the quantization-induced signal loss and mitigate the adverse effects of watermarks on rendered images. In the second stage, Robust3DGSW trains both 2D and 3D decoders using our proposed multi-scale adversarial perturbation approach, alongside a gradual quantization process, which enables robust watermark extraction even under excessive quantization. Comprehensive experimental results obtained from the well-known Blender, LLFF, and MipNeRF-360 datasets demonstrate that, when compared to leading 3DGS watermarking techniques, Robust3DGSW not only mitigates the negative effects of quantization on watermarks but also enables fast rendering with high quality.
Paperid: 3230,   Poster  
Authors: hualiang wang, Siming Fu, Weinan Jia, Yuning Lu, Mu Liu, Jidong Jiang, Xiaomeng Li
Title: Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
Abstract: Onedimensional (1D) visual tokenizers offer notable semantic compactness by discarding local spatial priors, and have become increasingly popular for image reconstruction and generation tasks. However, such global and sequential representations struggle to preserve fine-grained visual content; simply increasing network size or token count offers only superficial mitigation. To address this, we introduce VLTok, a novel 1D hybrid tokenizer that unifies Visual and Language representations in a shared Token space through a self-prompted training paradigm. During training, VLTok simultaneously generates 1D visual and textual tokens from images, aligning the textual tokens with embeddings from a pre-trained language model. This cross-modal alignment infuses implicit linguistic cues into the tokenizer, enhancing fine-grained image encoding. At inference, the self-prompted paradigm eliminates the need for external text, maintaining the simplicity of the image-only framework while benefiting from multi-modal guidance. Extensive experiments on the ImageNet benchmark demonstrate that VLTok achieves state-of-the-art performance in both image reconstruction and image generation. For example, under the same model parameter budget, our method yields relative reduction of 11.1% in rFID and 18.7% in gFID compared to GigaTok.
Paperid: 3231,   Poster  
Authors: Aviral Chharia, Fernando De la Torre
Title: Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
Abstract: Generating largescale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead — a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enforces grid-aligned coherence while capturing long-range dependencies. We further modify Mamba's standard unidirectional scanning into a Hierarchical Bi-directional State Scan (HiBiSS), scanning the render grid to better propagate geometric and appearance cues. Finally, we design an SE(3) MV Critic that judges whether a set of self-renders arise from a single underlying 3D configuration, rewarding cross-view pixel alignment without real MV data. In this setting, MVCHead surpasses SOTA in perceptual quality and on all three MVC axes—shape, texture, and geometry. The code has been submitted and will be open-sourced with model weights upon acceptance.
Paperid: 3232,   Poster  
Authors: Bing Han, Weiyuan Liu, changlong Zhang, Chenxi Wang, Zhibin Zhao, Zhi Zhai
Title: GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping
Abstract: Achieving dexterous grasping remains a key challenge in robotics. Recent generative approaches enable diverse grasps through largescale data-driven training, yet they often neglect geometric priors of objects, which leads to low data efficiency and poor physical plausibility. We propose GeoDexGrasp, a geometry-aware generation framework for dexterous grasping built upon object-centric geometric representations. We introduce a SIM(3)-equivariant network equipped with a self-supervised disentanglement strategy to extract interpretable and transferable geometric features, including shape, size, pose, and interaction direction.The overall generation process is then decomposed into two stages: first, root rotation generation conditioned on pose and interaction direction; second, hand grasp generation guided by shape and size. By leveraging geometric representations, GeoDexGrasp achieves SOTA physical plausibility (reducing 40% penetration depth) across five datasets, and exhibits improved data efficiency. Additionally, GeoDexGrasp is also lightweight (using less than 20% of the parameters of the previous SOTA method) and attains a comparable grasp success rate.
Paperid: 3233,   Poster  
Authors: Shiang-Feng Tsai, Yuan-Hong Liao, Jin-Cheng Jhang, Nan Qiao, Min Sun
Title: Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
Abstract: Partlevel pointing is important for fine-grained interaction and reasoning, yet existing Multimodal Large Language Models (MLLMs) remain limited to instance-level pointing. Part-level pointing presents unique challenges: annotation is costly, parts are long-tail distributed, and many are difficult to specify precisely in language. We introduce POinting at Parts (POP), a training-free, plug-and-play approach that addresses these challenges under a few-shot setup. POP fuses textual and visual attention maps with self-supervised visual correspondences from query image and few-shot examples. On average across the three evaluated datasets, POP achieves accuracy gains of up to 8.9 points in the one-shot setting and 16.4 points in the three-shot setting for the pointing-capable MLLMs—Qwen2.5-VL, Ovis2.5, and Molmo. Notably, even MLLMs without pointing capability benefit significantly from the proposed approach. These results establish a simple yet effective path toward fine-grained spatial grounding in MLLMs.
Paperid: 3234,   Poster  
Authors: WENBIN LUO, Takafumi Iwaguchi, Ryusuke Sagawa, Hiroshi Kawasaki
Title: Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
Abstract: The depth precision of an indirect timeof-flight (I-ToF) camera is highly dependent on its coding scheme. However, identifying the optimal coding scheme is challenging due to the infinitely many possible combinations of modulation and demodulation functions. Although previous works have derived depth-precision metrics to guide coding-scheme design, they either do not satisfy the constraints of real-world I-ToF devices or rely heavily on large-scale deep-learning optimization. In this work, we first analyze the error-propagation process in I-ToF depth sensing and derive a new metric for guiding the design and search of coding schemes. Then we incorporate practical hardware constraints of I-ToF sensors directly into the coding-scheme design, which greatly reduces the space of feasible modulation and demodulation functions and makes metric-based search feasible. The coding schemes obtained by our search method outperform previous schemes in both simulations and real-world experiments.
Paperid: 3235,   Poster  
Authors: Seong Je Oh, Ju Hwan Lee, Chae Yeon Lim, Donghwan Lee, Myung Jin Chung, Kyungsu Kim
Title: GHNAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
Abstract: The advent of hash encodings has evolved neural radiance fields (NeRF)based methods into fast and efficient 3D reconstruction techniques. In medical imaging, this framework has been extended to CT/CBCT reconstruction through neural attenuation fields (NAF), which directly model attenuation properties from projection data. Existing NeRF-based attenuation fields typically assume an idealized monoenergetic CBCT setting and therefore fail to model real-world projection inconsistencies such as scatter and noise contamination. Moreover, uniformly concatenating multi-resolution hash-grid features blends heterogeneous frequency components and noise into a single representation, causing artifacts: homogeneous regions acquire spurious high-frequency patterns, structural boundaries become blurred, and projection-induced bias propagates throughout the learned field. Given these limitations, we introduce the Grid-Adaptive Hash-Level–Attended Neural Attenuation Field (GH-NAF). Instead of collapsing noise-corrupted projection signals into a single feature space, GH-NAF trained each hash-grid level independently, guided by uncertainty-based confidence scores. This enables stable low-frequency modeling in homogeneous tissues while selectively preserving high-frequency detail around structural boundaries. Experiments on synthetic and real CBCT datasets demonstrate that GH-NAF reliably preserves intra-material contrast and achieves superior reconstruction quality compared with state-of-the-art methods.
Paperid: 3236,   Poster  
Authors: Zhiqiang Kou, Junxiang Wu, Wenke Huang, Wenwen He, Ming-Kun Xie, Changwei Wang, Yuheng Jia, Di Jiang, Yang Liu, Xin Geng, Qiang Yang
Title: FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning
Abstract: Multilabel representations encode higher-order label dependencies, yet in federated settings the local estimates of these dependencies are statistically inconsistent, causing structural drift across clients and rendering naive quantity-weighted aggregation suboptimal. We propose FedHarmony, a federated multi-label learning framework that harmonizes heterogeneous label correlations without sharing raw data. A Correlation Expert is formed by leave-one-out consolidation of clients’ label–label correlation statistics to provide a round-wise global consensus. Guided by this expert, each client performs consensus-guided correction that aligns its local correlation to the consensus within clusters of strongly related labels obtained via spectral clustering of the expert matrix. This block-wise alignment targets dense, high-signal subspaces. We establish two guarantees: (i) restricting alignment to in-cluster pairs strictly improves optimization curvature and linear convergence rate; (ii) ignoring cross-cluster entries incurs only a bounded, quantitatively small information loss when the consensus is near block-diagonal. Finally, a correlation-aware central aggregation combines data quantity with a dynamic measure of correlation learning quality, using a dynamic balance factor that transitions from quantity-driven weighting in early rounds to structure-driven weighting later. Extensive experiments under diverse non-IID regimes (varying label distributions, client heterogeneity, and client counts) show consistent gains over federated baselines in mAP/F1/Hamming Loss, with improved stability and communication efficiency.
Paperid: 3237,   Poster  
Authors: Jiarui Zhao, Libo Huang, Xiangqi Li, Zhulin An, Chuanguang Yang, Yu Wang, boyu diao, Yongjun Xu
Title: Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
Abstract: ClassIncremental Learning (CIL) aims to develop models to continuously learn new classes without forgetting learned old ones. Recent advances combine pre-trained models with parameter-efficient fine-tuning, achieving promising results. However, these approaches typically allocate new trainable parameters for each task, causing the model size to grow linearly with task number. Moreover, they lack explicit mechanisms to structure a coherent and discriminative representation space across tasks. To address these limitations, we proposeRepresentation-Steered Incremental Adapter Tuning(RSIAT). RSIAT maintains a single shared adapter for all tasks, eliminating parameter growth during incremental learning. In the base task, we introduce a representation-steering loss that enhances discriminative feature learning while facilitating future task adaptation. During incremental tasks, a residual autoencoder–based projector aligns feature distributions between old and new tasks, preserving representation consistency without over-constraining the shared adapter. Extensive experiments on six CIL benchmarks demonstrate that RSIAT significantly outperforms state-of-the-art methods in both performance and parameter efficiency, achieving superior stability–plasticity trade-offs with minimal trainable parameters.
Paperid: 3238,   Poster  
Authors: Mingjie Xie, heguangjun heguangjun, Dongli Xu, Youtian Lin, Hongjue Li, Pengming Feng, Jian Guan, Yue Deng
Title: SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception
Abstract: Openvocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inconsistency, where semantically equivalent expressions yield disparate spatial attention patterns. This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications. To address this issue, we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP tasks. SynCLIP introduces a Semantic-consistent Spatial Attention alignment (SSA) module to enhance spatial attention consistency by minimizing discrepancies between attention maps of original and synonymous expressions. Furthermore, a Spatial Attention Refinement (SAR) module selectively strengthens the most semantically relevant spatial regions within aligned maps, resulting in more precise and stable grounding. To support synonym-coherent pretraining, we also construct a Synonym-Enriched Visual Corpus (SEViC), which augments each category with multiple synonyms and textual definitions. Extensive experiments on multiple benchmarks demonstrate that SynCLIP substantially improves grounding consistency under diverse linguistic variants and achieves state-of-the-art performance among CLIP-based OVDP methods.
Paperid: 3239,   Poster  
Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, zhipeng cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
Title: VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice
Abstract: Chainof-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models in video understanding. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher compute cost. Motivated by this, we propose VideoAutoThink, a video understanding framework that adopts a ``reason-when-necessary'' strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised with verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAutoThink achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 144 to just 44 tokens. Moreover, we observe low activation of thinking on perception-oriented tasks, but higher activation on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
Paperid: 3240,   Poster  
Authors: Dingyi Zhao
Title: FedSST: Rethinking Fair Federated Graph Learning under Structural Shift
Abstract: Federated Graph Learning (FGL) offers a privacypreserving paradigm for collaborative training on graph data, yet significant topological heterogeneity poses a critical threat to generalization fairness, often yielding a global model dominated by a subset of clients. This introduces two critical issues: at the global level, aggregation bias disproportionately amplifies the influence of dominant clients, while at the local level, blind optimization results in inefficient and inequitable training processes. To address these challenges, we propose FedSST, an adaptive fairness framework. FedSST introduces a fair, structure-based signal to quantify client contributions, which in turn guides fair aggregation and adaptive local training. Extensive experiments across diverse cross-domain and cross-dataset settings demonstrate that FedSST enhances generalization fairness and overall model performance, outperforming various state-of-the-art methods.
Paperid: 3241,   Poster  
Authors: Jiachen Lu, Hailan Shanbhag, Haitham Al Hassanieh
Title: Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
Abstract: Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing NonLine-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, our system achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.
Paperid: 3242,   Poster  
Authors: Seonho Kim, JUNHYEONG HONG, Kyungjae Lee, Yoonseon Oh
Title: INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
Abstract: Humans intuitively rely on text and symbols inscribed on objects (e.g. "PULL", "Squeeze and Turn") to perform tasks safely and correctly. In contrast, visionlanguage-action models excel at following external language commands, but remain largely unaware of this object-centric information. This capability is essential for reliable robotic operation, yet progress remains unmeasured due to the absence of standardized benchmarks. To address this gap, we introduce INSIGHT Bench, a benchmark that formalizes the task of ``in-situ guide grounding". INSIGHT Bench provides a comprehensive taxonomy that evaluates how agents utilize diverse guide information, including action-direction cues and procedural instructions. It also includes a scalable simulation framework that procedurally generates tasks and programmatically links each visual guide to its corresponding physical constraint. We release both the benchmark and the resulting trajectory dataset to support future research. Our evaluation of state-of-the-art VLA models reveals a critical limitation: their ability to ground in-situ guides is inconsistent and strongly dependent on the type of information. While models succeed on some guide categories, they frequently fail on others. However, performance improves substantially when the same information is provided as language instructions, indicating that in-situ guides could contribute to manipulation performance if VLAs were capable of interpreting them. These findings underscore the need for further research on understanding and grounding in-situ guides.
Paperid: 3243,   Poster  
Authors: Wenkang Zhang, Kaicheng Yang, Xiang An, Qiang Li, Ziyong Feng, Wankou Yang, Jiankang Deng
Title: Towards Streaming Referring Video Segmentation via Large Language Model
Abstract: Current referring video segmentation methods typically operate in an offline manner, where sparse frames are first selected for imagelevel referring segmentation, and the resulting masks are then propagated across the video. Although video sampling captures global context, its isolated processing steps not only complicate optimization but also restrict applicability to real-world streaming scenarios. In this paper, we propose a simple but efficient MLLM-based framework StreamingRVOS, which can extend image-level segmentation to video-level via a streaming pipeline without introducing extra parameters. Specifically, we employ a Semantic Embedding Recycling (SER) method to propagate temporal context across frames, enabling the model to perceive semantic representation in the video. Then, we propose an Online Mask Consistency Perception (OMCP) strategy to adaptively invoke the MLLM to re-perceive the current scene and regenerate the semantic embedding. We conduct extensive experiments on multiple downstream datasets to prove the effectiveness of StreamingRVOS. Compared to previous methods, our method achieves excellent performance in referring video segmentation (1B variant improves upon Sa2VA by 19.2 on the MeViS dataset), while operating at an average speed of 7 FPS under streaming inference on 1 × A800 GPU.
Paperid: 3244,   Poster  
Authors: Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
Title: Ego2Web: A Web Agent Benchmark Grounded on Egocentric Videos
Abstract: Multimodal AI agents are increasingly automating complex realworld workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundingsand then complete a related task online (e.g., making a purchase).To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and multimodal web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commercial, navigation, media search, and so on. To facilitate a more accurate and scalable evaluation for our novel benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method Ego2WebJudge, and demonstrate around 85% agreement with human judgment, substantially higher than existing evaluation methods.Experiments with diverse SoTA multimodal agents show that they perform significantly below the human level, revealing a major gap in capability. We also conduct a comprehensive ablation study on task design, highlighting the necessity of video perception in the proposed task and the limitations of current agents.We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.
Paperid: 3245,   Poster  
Authors: Zhenghao Huang, Kaikai Wang, HUILIN YAO, Lin Shu
Title: Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG
Abstract: Electromyography (EMG) is crucial for decoding human motor intentions and achieving natural humancomputer interaction, but its generalization ability across subjects, devices, and tasks has long been limited by data heterogeneity, scarce annotations, and the lack of a unified representation paradigm. In this work, we introduce a novel perspective on EMG signals, treating muscle contractions as words and activation sequences as sentences. Based on this perspective, we design a Neuromuscular Contraction Tokenizer (NCT) that generates semantically consistent EMG sentences from raw signals. Building on this, we propose the first large-scale pre-training framework for EMG—Any Electromyography (AEMG), a general EMG representation learning framework based on self-supervised pre-training. Furthermore, we construct the largest cross-device EMG vocabulary to date, which supports seamless transfer across arbitrary channel topologies and sampling rates. Extensive experiments demonstrate that AEMG outperforms state-of-the-art baselines by 5.79–9.25% in zero-shot leave-one-subject-out accuracy, and achieves over 90% few-shot adaptation performance with only 5% of the target user’s data. Our work has proposed the concept of electromyography signals as a cross-device physiological language, learned their grammar from massive amounts of data, and laid the groundwork for a single-training, universally applicable EMG foundation model.
Paperid: 3246,   Poster  
Authors: Lilin Zhang, Yimo Guo, Li Yue, Jiancheng Shi, Xianggen Liu
Title: Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
Abstract: Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by realworld long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose Rebalanced Adversarial Intensity for Long-Tailed Data (RAIL), a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RAIL consistently enhances adversarial robustness and class-balance on long-tailed datasets.
Paperid: 3247,   Poster  
Authors: Jinming Chai, Lingling Li, Licheng Jiao, Xiaoqiang Lu, Long Sun, Xu Liu, Wenping Ma, Weibin Li
Title: ContourVertex: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
Abstract: Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multitasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At representation level, we introduce language-guided unified contour decoding paradigm (LCUDP) that takes language-conditioned contour as the intermediate carrier to decode REC and RIS synchronously, structurally preserving geometric and semantic consistency and enabling lightweight, efficient decoding. At refinement level, we introduce residual coarse-to-fine encoding (RCE), shifting fine stage from learning-from-scratch to error correction. At reaggregation level, we design channel isolated multi-scale fusion (CIMF) to achieve lossless feature fusion. At regularization level, we employ gradient consistency loss (GCL) to enhance LCUDP and improve boundary boundary adherence. Moreover, we validate RECS4R on remote-sensing and natural datasets, including RefDIOR, RRSIS-D, OPT-RSVG, RefCOCO, RefCOCO+, and RefCOCOg, and verify the image encoder under CNN, Transformer, and Mamba backbones, achieving advanced performance. The code will be coming soon.
Paperid: 3248,   Poster  
Authors: Shangjie Xue, Jesse Dill, Dhruv Ahuja, Frank Dellaert, Panagiotis Tsiotras, Danfei Xu
Title: Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
Abstract: We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network–based uncertaintyaware volume rendering process, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation.Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.
Paperid: 3249,   Poster  
Authors: Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, Wei-Shi Zheng
Title: CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
Abstract: In this paper, we explore an important yet underexplored task in robot manipulation: cyclebased manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we firstly propose the CycleManip framework to achieve cycle-based task manipulation in a end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Secondly, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptation performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.
Paperid: 3250,   Poster  
Authors: Bingwen Dong, Gan Liu, Xiaoxi Lu, Guangcheng Chen, Jialu ZHANG, Yan Hu, Xiaoqing Zhang, Jiang Liu
Title: Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
Abstract: Accurate depth estimation is crucial for 3D reconstruction and precise navigation in ophthalmic fundus surgery. However, acquiring annotated data remains challenging due to the impracticality of depth sensors under surgical microscopes.To overcome this limitation, we introduce RetinalDepth64K, a novel synthetic dataset comprising 64,000 stereo image pairs across 1,280 diverse scenes, developed through a Real2Sim2Real pipeline that transforms real-world fundus surgery videos into synthetic data and facilitates model deployment in real scenarios. We analyzed key characteristics such as intricate retinal textures from real-world videos to guide the Real-to-Sim phase, enabling realistic data synthesis.To improving dataset fidelity for depth estimation, we created 3D eye models using Blender with ultra-wide-field retinal textures, glass-modeled aqueous humor, and dynamic instrument trajectories, enhanced by post-processing to ensure photorealism.The dataset provides RGB images, depth maps, normal maps, and instrument segmentation masks from binocular view, supporting the training of monocular, binocular, and video-based depth estimation models to enhance robustness. In the Sim-to-Real phase, quantitative and qualitative experiments show that finetuning foundation models with RetinalDepth-64K produces accurate depth predictions for synthetic data. Comparative analysis on results of zeroshot and finetuned models further validates robust generalization to real fundus surgery scenes, offering significant potential to enhance surgical precision and support the training of novice surgeons through reliable depth cues.As the first dataset of its kind for retinal surgery, RetinalDepth-64K offers a vital resource for advancing 3D reconstruction and surgical navigation in ophthalmology.
Paperid: 3251,   Poster  
Authors: Yuntao Du, Yiming Wang, Renshuo Yuan, Jincheng Yue, Yijing Chen, Yue Fan, Bo Zhang, Qian Li, Lizhen Cui
Title: VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models
Abstract: Understanding and reasoning over structured knowledge is a fundamental capability for intelligent systems. While Large Language Models (LLMs) have leveraged textual knowledge graphs for relational reasoning, linearizing graph structures into text often leads to token inefficiency and loss of higherorder relational cues. Inspired by the advances of Large Multimodal Model to capture higher-order relational structures explicitly novel paradigm of visualized knowledge representation, where knowledge graphs are transformed into graphical visualizations that LMMs can directly perceive and reason over. To systematically evaluate this capability, we introduce VKG-QA, a benchmark for Visual Knowledge Graph-based Question Answering, covering three major categories and fourteen subtasks. VKG-QA is constructed via a semi-automatic pipeline ensuring high-quality, semantically aligned, and visually clear data. We evaluate 19 representative LMMs on VKG-QA and perform extensive quantitative and qualitative analyses. Results reveal that current models struggle with visualized relational understanding, graph-specific comprehension remains challenging, and closed-source models significantly outperform open-source counterparts. VKG-QA thus highlights critical limitations in current LMMs and provides a scalable platform for advancing graph-aware visual reasoning.
Paperid: 3252,   Poster  
Authors: Yaoyu Jin, Xiaochun Yang, Hong Liu, Leixia Wang, Jian Li, Rui Ding, Bin Wang
Title: COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning
Abstract: Recent advances in textto-image (T2I) generation can produce highly resembling images of copyrighted characters, often indistinguishable from official depictions, raising serious concerns about intellectual property infringement. Consequently, robust detection of copyright character infringement is urgently needed. Yet, existing methods exhibit limited alignment with human judgments regarding the likelihood of infringement. To bridge this gap, we propose \textscCopyLens, a novel prompt optimization framework that automatically refines textual prompts for vision-language model-based detectors to better match human infringement judgments. Our approach establishes a closed-loop refinement process between a large vision-language model (LVLM) and a large language model (LLM): the LVLM assesses generated images for copyright detection, while the LLM iteratively optimizes detection prompts via meta-prompting, guided by feedback signals derived from human annotation consistency. To facilitate the assessment of prompt-human alignment, we introduce \textscCopyChars, a new large-scale dataset of over 7,000 AI-generated images spanning more than 100 popular copyrighted characters, along with detailed human annotations on potential infringement. Extensive experiments on \textscCopyChars show that the proposed \textscCopyLens can improve detection performance by 5% to 10% compared to recent state-of-the-art methods. This work offers a scalable and automated solution for visual copyright protection and highlights the critical role of prompt engineering.
Paperid: 3253,   Poster  
Authors: Shuhao Han, Wenjie Liao, Haotian Fan, Hang Dong, Rui Zhang, Chun-Le Guo, Chongyi Li
Title: DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution
Abstract: Benefiting from the powerful generative priors of diffusion models, diffusionbased real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance.To achieve efficient Real-ISR, several recent works have designed one-step diffusion-based models.Howerver, unmediatedly feeding LR into a diffusion model creates a distributional gap with the model's original input.A straightforward approach to reduce the distribution gap is to introduce noise to the LR latents. However, directly adding noise inevitably corrupts the content of the LR images.In this study, we propose DNF‑SR, a Dual‑input and Negative‑aware Feature fine‑tuning method for Real-ISR.Specifically, we use a dual-input strategy that concatenates the original LR image with the noisy LR input and feeds them into a diffusion-based image editing model, ensuring both high-fidelity one-step super-resolution and improved perceptual and content consistency.Additionally, the noise present in the noisy LR input introduces randomness and diversity into the outputs. We exploit this property and propose a post-training optimization method, Negative-aware Feature Fine-Tuning (NF²T), which guides the model toward producing higher-quality results.NF^2T classifies multiple outputs into positive and negative subsets and then defines implicit policy improvement directions in both the image and feature spaces, thereby further enhancing the stability of the optimization.Extensive experiments show that DNF-SR outperforms other methods.Code will be released.
Paperid: 3254,   Poster  
Authors: Zhiwen Zheng, Hao Zhou, Huiyu Qi, Zhao Huang, Guangyuan Zhang, Shaowei Jiang, Wenwen Tang, Bin Yang, Jin Liu, Xiaoshuai Zhang, Xingru Huang
Title: Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections
Abstract: This paper studies passive nonline-of-sight corner-camera detection and human localization using faint indirect reflections on a visible wall. The challenge is twofold: multi-exposure wall observations are unstable and entangled with sensor nonlinearities, and mapping these observations to a hidden-view RGB image is severely underdetermined, making purely discriminative regressors brittle and unconstrained diffusion priors stochastic. To address these challenges, we introduce the Similarity-Likelihood Diffusion Network (SLD-Net), a two-stage framework that produces measurement-consistent, deterministic reconstructions. First, DeLi-Inversion forms an exposure-aware differential representation and jointly predicts an initial reconstruction and a pixel-wise precision map, yielding a heteroscedastic pseudo-likelihood. Second, SiCo-Diffusion injects this likelihood as precision-weighted energy into a deterministic DDIM trajectory and fuses it with the diffusion prior using an annealed Bayesian precision rule, producing a unique reconstruction for fixed observations and schedules. Extensive experiments on two real datasets: Reflect-Corridor and Reflect-Room, demonstrate that the proposed method outperforms generic, physics-inspired, and NLOS-specific baselines across PSNR, SSIM, LPIPS, and FID. In particular, relative to the best-performing baseline, it improves PSNR from 13.84 to 15.58 dB on Reflect-Corridor and from 11.58 to 12.49 dB on Reflect-Room, and reduces FID from 264.91 to 73.54 and from 177.05 to 26.89, respectively, while also achieving the lowest LPIPS on both datasets.
Paperid: 3255,   Poster  
Authors: Yulin Zhang, Cheng Shi, Sibei Yang
Title: WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs
Abstract: Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them illsuited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past–current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model-agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective—our Streaming Order Perception enhancement—that instills order-aware representations with minimal finetuning and no specialized streaming data. At inference, a Past–Current Dynamic Focus Cache performs uncertainty-triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time-aware stream Video-LLMs under strict online, time-causal constraints. Code and weights will be made publicly available.
Paperid: 3256,   Poster  
Authors: Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao
Title: GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
Abstract: Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of highquality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we proposeGlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct theGlyphCorrectordataset with region-level glyph preference annotations and proposeRegion-Grouped DPO(R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduceRegional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
Paperid: 3257,   Poster  
Authors: Mengmeng Sheng, Zeren Sun, Tao Chen, Jinshan Pan, Yazhou Yao, Fumin Shen
Title: Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression
Abstract: Learning with noisy labels (LNL) has received growing attention, with most prior work following the paradigm of cleansample reliance (e.g., sample selection). However, this reliance also imposes intrinsic limitations, as overfitting to even a few noisy samples is inevitable, creating a major bottleneck for further improvement. This limitation motivates us to go beyond mere clean-sample reliance and explore how to actively forget corrupted knowledge already internalized by models while suppressing further noise assimilation. To this end, we propose FINE, a fundamentally novel perspective for LNL that unifies active ForgettIng via machine unlearning (MU) and Noise supprEssion via negative learning (NL) within a cohesive framework. Specifically, we first reveal two key stages of noise fitting: early-stage generalized learning and later-stage noise overfitting. To actively forget early-stage noise accumulation, we introduce an MU-based module that employs a negative cross-entropy loss to erase corrupted knowledge, while an NL-based module leveraging complementary labels suppresses later-stage overfitting and mitigates reliance on noisy supervision. These modules act synergistically as plug-and-play regularizers, seamlessly integrating into existing baselines. Finally, extensive experiments on both synthetic and real-world noisy benchmarks demonstrate that our FINE consistently boosts robustness and generalization.
Paperid: 3258,   Poster  
Authors: Haiyang Xu, Ronghuan Wu, Li-Yi Wei, Nanxuan Zhao, Chenxi Liu, Cuong Nguyen, Zhuowen Tu, Zhaowen Wang
Title: SemLayer: Semantic Generative Segmentation and Layer Reconstruction for Vector Icons
Abstract: Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened singlepath or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation–based pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inaccessible for flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task.
Paperid: 3259,   Poster  
Authors: Hezhao Liu, jiacheng yang, Junlong Gao, Mengke Li, Yiqun Zhang, Shreyank Gowda Gowda, Yang Lu
Title: SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
Abstract: In openworld semi-supervised learning (OWSSL), a model learns from labeled data and unlabeled data containing both known and novel classes. In practical OWSSL applications, models are expected to perform rigorous classification by directly selecting the most semantically relevant label from a candidate set for each sample. Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. SECOS leverages external knowledge to extract and align semantic representations across modalities for both known and novel classes, providing explicit supervisory signals for training novel classes. Extensive experiments demonstrate that even when existing OWSSL methods are evaluated under the more lenient post-hoc matching setting, SECOS still surpasses them by up to 5.4% without such assistance, highlighting its superior effectiveness. Code is available in the supplementary materials.
Paperid: 3260,   Poster  
Authors: Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu
Title: Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
Abstract: Generating human motion that satisfies customized zeroshot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme. Code will be released upon publication.
Paperid: 3261,   Poster  
Authors: miao xu, Xiangyu Zhu, Zidu Wang, XUSHENG LIANG, Bao Li, Jinlin Wu, Zelin Zang, Zhen Lei
Title: ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding
Abstract: Understanding 3D human–object interaction (HOI) involves two highlyrelated abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human–object contact regions. To further enhance realism and physical plausibility when generating the outputs of reconstruction and generation, we modify and adapt the Gravity-Field Based Diffusion Bridge to refine fine-grained contact geometry and ensure smooth, physically consistent human–object engagement. Extensive experiments demonstrate that our unified framework significantly improves both reconstruction accuracy and generative interaction quality, establishing a cohesive and interpretable paradigm for 3D HOI understanding.
Paperid: 3262,   Poster  
Authors: Shenyin Xu, Yishan Wang, Xinyu Li, Rui Liu, Zhongyuan Wang, Xin Tian
Title: MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition
Abstract: 3D LiDARbased gait recognition has gained increasing attention due to its robustness to illumination, privacy preservation, and capability for long-range and non-contact identity verification. However, existing point cloud-based methods suffer from two critical limitations: they fail to model semantically distant correlations across spatial scales and employ simplistic temporal aggregation that cannot handle gait's inherent heterogeneity. To address these limitations, we propose MS^2Gait, a multi-scale spatio-temporal framework tailored for raw point cloud gait recognition. Our Hierarchical Spatial Feature Extraction module introduces four complementary interaction strategies to explicitly capture long-range semantic dependencies and recover structural information under blockage. Additionally, a Similarity-based Temporal Enhancement Transformer strategy leverages multi-scale aggregation to dynamically weight frames based on motion coherence, effectively handling temporal heterogeneity without explicit supervision. Extensive evaluations on SUSTech1K and FreeGait demonstrate that MS^2Gait achieves 93.5% and 83.1% in Rank-1 accuracy, respectively, outperforming prior state-of-the-art methods, while exhibiting significant robustness against non-gait nuisance factors.
Paperid: 3263,   Poster  
Authors: ZIJIAN ZHU, Huang Qiusheng, AnboyuGuo AnboyuGuo, Xiaohui Zhong, Hao li
Title: AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety–Critical Cloud Forecasts
Abstract: Current AI weather forecasting models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety. We introduce AviaSafe, a hierarchical, physicsinformed neural forecaster that produces global, six-hourly predictions of these four hydrometeor species for lead times up to 7 days. Our approach addresses the unique challenges of cloud prediction: extreme sparsity, discontinuous distributions, and complex microphysical interactions between species. We integrate the Icing Condition (IC) index from aviation meteorology as a physics-based constraint that identifies regions where supercooled water fuels explosive ice crystal growth. The model employs a hierarchical architecture that first predicts cloud spatial distribution through masked attention, then quantifies species concentrations within identified regions. Training on ERA5 reanalysis data, our model achieves lower RMSE for cloud species compared to baseline and outperforms operational numerical models on certain key variables at 7-day lead times.The ability to forecast individual cloud species enables new applications in aviation route optimization where distinguishing between ice and liquid water determines engine icing risk.
Paperid: 3264,   Poster  
Authors: bo zhao, Junzhe Cao, Dan Guo, Dongmin Huang, Wenjin Wang, Tao Tan, Yue Sun, Zitong YU
Title: FLOW: Optimal Transport-Driven Feature Warping for Generalized Remote Physiological Measurement
Abstract: Remote photoplethysmography (rPPG) enables noncontact physiological measurement from facial videos but often suffers from severe performance degradation under domain shifts. Traditional STMap-based methods~\citeniu2019rhythmnet rely on predefined spatio-temporal representations that offer engineered robustness but discard fine-grained temporal cues. In contrast, end-to-end rPPG models directly learn hierarchical features from raw videos, capturing richer physiological patterns yet remaining highly sensitive to variations in illumination, motion, and camera sensors. To address these challenges, we propose FLOW (Feature-Level Optimal Warping), an Optimal Transport (OT)–driven and plug-and-play framework for multi-source domain generalization in rPPG measurement.FLOW formulates domain shifts as structured Optimal Transport problems and performs feature-level warping to align multiple source domains in a shared latent space. Specifically, a dual-consistency regularization is proposed to enforce both frequency fidelity and mapping invariance, while a shape-adaptive alignment module is designed to enable architecture-agnostic integration without re-training. We further derive a generalization bound based on conditional OT discrepancy, providing theoretical insight into FLOW’s robustness under distributional shifts. Extensive experiments across diverse rPPG benchmarks demonstrate that FLOW consistently improves cross-domain generalization while maintaining lightweight and modular deployment.
Paperid: 3265,   Poster  
Authors: Qichao Wang, Yunhong Lu, Hengyuan Cao, Junyi Zhang, Min Zhang
Title: DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
Abstract: Dataset distillation enables efficient training by distilling the information of largescale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We propose a pioneering theoretical framework for guidance design, proving that optimizing distributional distance under semantic alignment equivalently tightens the upper bound of dataset distillation objectives. Therefore, we first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.
Paperid: 3266,   Poster  
Authors: Juxin Lu, Haoyu Shi, Mengyao Wang, Huaiwen Zhang
Title: Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
Abstract: Machine Unlearning (MU) focuses on removing the influence of training samples from pretrained models without retraining the model entirely. Existing MU methods have made several efforts to enable complete forgetting while preserving the model’s performance on remaining data. However, they typically apply equal weights across different data, overlooking the ambiguous decision boundaries between similar samples or approximate classes. This leads to unnecessary consumption of shallowly memorized samples and significant performance degradation for approximate retention classes. Additionally, the inherent inconsistency between forgetting and retention objectives results in gradient conflict and domination problems during training, hindering model convergence and degrading overall performance. To address these, we introduce a novel adaptive gradient reweighting that assigns importance weights to individual forget samples or vulnerable retention classes, thereby enabling more efficient unlearning and preserving the performance of approximate classes. Subsequently, we propose a multi-stage objective optimization strategy, which comprises three optimization stages: Direction Rectification, Temporal Stabilization, and Adaptive Objective Combination. This strategy rectifies the direction of conflicting gradients and prevents one task (forgetting or retention) from dominating the model update. Comprehensive analyses and extensive experiments on multiple public datasets demonstrate that our method achieves considerable performance improvements in various tasks and scenarios.
Paperid: 3267,   Poster  
Authors: Yuwu Lu, Chunzhi Liu
Title: Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression
Abstract: BlackBox Domain Adaptation (BBDA) is a highly practical yet challenging strategy that enables the deployment of pre-trained detectors to new unlabeled target domains without accessing source data or models. Compared to previous domain adaptation studies, BBDA not only provides stronger data privacy protection but also offers greater portability. Despite growing interest, existing BBDA strategies remain difficult to apply directly to object detection, as most prior works focus on classification and segmentation tasks that do not involve bounding box localization and rely on different learning mechanisms. In this paper, inspired by lifelong learning, we propose Retention-Driven Knowledge Compression (RDKC), which applies a brain-inspired continual learning process to BBDA for object detection. Specifically, RDKC consists of two key components: Memory Retention (MR) and Scene Compression (SC). MR is designed specifically for object detection under the BBDA setting, where it performs memorized contrastive learning on partitioned regions to better utilize informative cues from reliable areas while filtering out potential noise from noise prediction labels. SC introduces a contrastive mechanism between near- and far-view regions, which enables the model to better learn from far-view regions under the guidance of near-view cues. Experimental results demonstrate that under the BBDA setting, RDKC outperforms previous SOTA methods across all evaluated benchmarks, achieving superior performance improvements.
Paperid: 3268,   Poster  
Authors: Shawn Huang, Brian Price, Yifei Fan, Bryan Morse
Title: Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
Abstract: Automatic album organization has been studied extensively over the past decades due to significant progress in digital photography. Recent visionlanguage models (VLMs) have shown strong performance on multi-image understanding, making them natural candidates for automating album organization workflows. While VLMs' abilities in multi-image understanding have been widely studied, their performance on album organization remains underexplored. To bridge this gap, we introduce AlbumBench, the first comprehensive benchmark for automatic album organization. Specifically, we (1) define album organization tasks as photo selection for album-specific user objectives, photo rating according to how well user intents are fulfilled, and album-specific photo grouping given a user query which requires contextual understanding of the album; (2) establish AlbumBench, a benchmark dataset containing 27051 images across 641 albums with 5 annotations per image; and (3) evaluate mainstream open-source and proprietary VLMs on AlbumBench. We show that AlbumBench presents unique challenges compared to traditional multi-image understanding benchmarks due to its requirement for understanding album context and user intent. Our findings reveal a significant performance gap between open-source and proprietary VLMs on album organization tasks. Despite this gap, even the best-performing proprietary models sometimes struggle with tasks that humans find relatively easy. We hope that AlbumBench can serve as a foundation for unifying album organization research and motivate improvements in VLMs' performance on these tasks.
Paperid: 3269,   Poster  
Authors: YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang
Title: CoLoGen: Progressive Learning of Concept–Localization Duality for Unified Image Generation
Abstract: Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to conceptlocalization representational conflict.To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages.Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
Paperid: 3270,   Poster  
Authors: Tianle Lyu, Mengjingcheng Mo, Ting Wen, Zhen Song, Zinan Xiong, Yanjie Zhu
Title: Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction
Abstract: Anatomical structures in MRI exhibit strong spatial priors, including welldefined boundaries, low inter-subject variability, and consistent topology. These properties naturally induce clustered patterns in the latent space, which are difficult to capture using conventional continuous generative priors that assume smooth manifold distributions. To address this limitation, we propose DiCoS (Discrete–Continuous Synthesis), a generative reconstruction framework that integrates discrete structural reasoning with continuous refinement. DiCoS models an anatomy-aware discrete distribution and generates diverse reconstructions in one coarse-to-fine pass through a Discrete Prior Network (DPN). A Dual-domain Balanced Scoring (DBS) mechanism adaptively evaluates candidates using both image-domain fidelity and k-space consistency. To further enhance realism, Micro Diffusion Cycles (MDC) perform efficient score-guided refinement to enhance texture realism without disturbing global topology. Experiments on the fastMRI knee and brain datasets demonstrate that DiCoS achieves state-of-the-art reconstruction quality with sharper boundaries and improved anatomical consistency. Beyond pixel metrics, segmentation-based evaluations further confirm superior structural overlap and semantic alignment, highlighting DiCoS's advantages in anatomy-aware reconstruction. Code and models will be released upon publication.
Paperid: 3271,   Poster  
Authors: Ruiqing Tian, Mohan Sai Singamsetti, Di Niu, Bahador Rashidi
Title: ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Abstract: Large Vision Language Models (LVLMs) exhibit strong perceptual and linguistic capabilities yet struggle with complex visual reasoning tasks that require structured, compositional, and adaptive inference. Existing approaches either rely on costly inferencetime exploration—such as multi-path or tree-based Chain-of-Thought (CoT) search—or on expensive post-training with large curated CoT datasets. We propose ReaGEN, a lightweight framework for the adaptive generation of structured reasoning chains that enhances reasoning without modifying the underlying vision–language model (VLM). ReaGEN first employs a teacher-guided evolutionary search to collect sample specific CoT structure, leveraging attention-derived stage importance to capture how information flows across reasoning stages. These adaptive CoT structures are then used to train a compact generator (GEN) that learns to refine and improve CoT structures by reflecting on attention feedback from the reasoning process. At inference, the GEN dynamically produces question-adaptive structured CoTs, and can be iteratively invoked to refine them based on the VLM’s internal state—achieving the flexibility of deep search with single-path efficiency. Across diverse multimodal reasoning benchmarks, ReaGEN achieves up to +26 accuracy points over test-time scaling methods while reducing the average inference-time token usage by 79%, establishing a scalable and model-agnostic approach for structured reasoning generation in VLMs.
Paperid: 3272,   Poster  
Authors: Torsten Sattler, Zuzana Kukelova
Title: Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
Abstract: Visual localization, i.e., the problem of estimating the camera pose from which an image was taken, is an important part of applications such as augmented reality and autonomous robots. Many of these applications require a compact memory footprint. Thus, a considerable amount of work has been spent on designing memoryefficient scene representations for visual localization. In this paper, we focus on compressing the 3D structure of the scene by selecting a subset of points from a Structure-from-Motion (SfM) point cloud. In contrast to prior work, which aims to solve (complex) optimization problems, we propose a simple strategy that is almost trivial to implement. Our compression strategy is based on the idea of selecting triplets of points such that the camera pose of each database image (used to build the SfM point cloud) can be accurately estimated from these triplets. Despite its simplicity, our strategy performs similarly to or better than current state-of-the-art structure compression approaches. Combined with standard product quantization approaches to compress feature descriptors, our approach compares favorably with recent learning-based approaches for compact visual localization.
Paperid: 3273,   Poster  
Authors: xiongzhuang liang, Chuanbo Tang, Zhuoyuan Li, Li Li, Dong Liu
Title: Perceptual Neural Video Compression with Color Separation and Rank Chain
Abstract: Neural video compression (NVC) has achieved significant progress in recent years. The stateof-the-art (SOTA) NVC schemes, exemplified by the Deep Conditional Video Coding (DCVC) series, have focused on pursuing higher fidelity (e.g., PSNR), but lack sufficient exploitation of deep networks' advantages for better perceptual quality. We fill in this gap with two new techniques. First, we propose a color-separation-based framework, termed PNVC-C, which decouples luminance and chrominance processing to better align with human visual perception. This framework enables explicit and adaptive allocation of computation and bitrate budgets between luminance and chrominance components.Second, within this framework, we introduce the perceptual optimization scheme Rc-GAN, which leverages a bitrate-based rank chain loss to link variable-rate coding with perceptual quality ranking, enforcing consistent quality ordering and improving perceptual fidelity.Built upon these designs, we establish the PNVC-C framework with two variants: PNVC-C-Base, optimized for objective fidelity, and PNVC-CR, a perceptual variant that applies the Rc-GAN. Experimental results demonstrate that PNVC-C-Base achieves SOTA objective performance in YUV PSNR, while PNVC-CR attains SOTA perceptual quality on LPIPS, DISTS, KID, and FID metrics.Code and models will be publicly available.
Paperid: 3274,   Poster  
Authors: Hyeonseo Jang, Hyuk Kwon, Kibok Lee
Title: Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
Abstract: We investigate the recently introduced domainclass incremental learning scenarios for vision-language models (VLMs), where both domain and class distributions change across tasks. Recent works address this challenge using parameter-efficient methods such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code will be released.
Paperid: 3275,   Poster  
Authors: Yaxuan Qin, Hefei Li, Wenqi Mu, Yancheng He
Title: Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have led to significant progress in video understanding. Due to limited context windows and computational overhead, most MLLMs adopt uniform frame sampling. This approach is at high risk of missing critical visual information and constrains performance especially for long videos. To address this problem, we propose a lightweight frame selection method to identify keyframes and train it via a twostage strategy. In the pre-training stage, the frame selector learns to model relevance between individual video frames and queries. In the reinforcement learning (RL) stage, we employ a hierarchical reward that evaluates selection quality at combination and frame levels. Through stochastic exploration of frame combinations, the selector learns to identify and retain frames that improve task performance rather than merely maximizing query relevance, which can be misleading. The selected frames serve as input to downstream MLLMs for video understanding and reasoning. Experimental results demonstrate the proposed selector improves performance of diverse downstream MLLMs across benchmarks spanning medium to long videos.
Paperid: 3276,   Poster  
Authors: Jing Huang, Luyuan Chen, Zhijie Xu, Yadong Li, Xingzhong Xu, Siye Chen, Jie Liu, Ming Kong, Qiang Zhu
Title: META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
Abstract: Longvideo understanding remains challenging due to extreme temporal redundancy, sparse yet decisive events, and the instability of long-horizon reasoning in visual–language models (VLMs). Existing agent-based methods invoke external micro-tools but remain static, repeatedly rebuilding long chains of fine-grained operations for each task without acquiring reusable multi-step perceptual skills.We propose META, the first training-free agent capable of self-evolving its tool-augmented reasoning. META operates through dual Solving and Evolving loops: it analyzes its own tool trajectories, abstracts recurring multi-step patterns into reusable macro-tools, and distills failed executions into structured failure priors that refine tool usage. Through symbolic consolidation and pruning, META progressively shortens reasoning paths and acquires more general perceptual and temporal abilities—without any parameter updates. META achieves state-of-the-art performance on long-video benchmarks, demonstrating a scalable, model-agnostic paradigm for long-video understanding that can continually evolve without additional training.
Paperid: 3277,   Poster  
Authors: Meng Yuan, Dawei Lin, Hongxia Xie, Tieru Wu, Rui Ma
Title: CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing
Abstract: ComputerAided Design (CAD) modeling underpins a wide range of industrial applications. During the conceptual design phase, designers often refine initial solutions iteratively to achieve desired results. A key goal of AI-assisted CAD is to support the full modeling workflow from initial generation to iterative refinement. However, most existing approaches treat generation and editing as separate tasks, hindering coherence and adaptability in real-world scenarios. To address this limitation, we propose CAD-Refiner, a unified framework that supports free-form multimodal inputs and enables iterative refinement over previously generated results. Specifically, we design an agent named CAD Insighter that interprets multimodal inputs into topological structure graphs, which explicitly represent the fundamental elements and their relationships within CAD objects. We then propose a carefully designed decoder architecture and a Sequence Injection Strategy (SIS) to enable multiple applications within a unified modeling framework. Furthermore, we propose CAD Checker, an error-aware feedback module that performs geometry-based reward shaping during optimization, enhancing modeling quality and geometric validity. Additionally, we introduce MMCAD, a multimodal extension of DeepCAD tailored for CAD generation and editing. Extensive experiments demonstrate the effectiveness of CAD-Refiner across multiple tasks.
Paperid: 3278,   Poster  
Authors: Senyan Xu, Zhijing Sun, Kean Liu, Xin Lu, Ruixuan Jiang, Xueyang Fu, Zheng-Jun Zha
Title: Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset
Abstract: Eventbased low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations.Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset.Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset will be released.
Paperid: 3279,   Poster  
Authors: Sanaz Karimijafarbigloo, Armin Khosravi, Alireza Kheyrkhah, Reza Azad, Mauricio Reyes, Dorit Merhof
Title: Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
Abstract: Multirater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance (GED)–based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty, confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.
Paperid: 3280,   Poster  
Authors: Tao Zhang, Shengtao Yao, Rong Zeng, Zunjie Zhu, Bolun Zheng, Yaoqi Sun, Ying Fu, Chenggang Yan
Title: EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution
Abstract: Hardware constraints make it challenging to simultaneously acquire hyperspectral images (HSIs) with both high spatial and high spectral resolutions. A promising solution is to fuse lowresolution HSI (LR-HSI) with high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Recently, diffusion models have introduced possibilities for HSI super-resolution, but suffer from low-efficiency sampling, detail-limited generation, and insufficient denoising. To address these issues, we propose an Edge-aware Multimodal Residual Diffusion Model (EMR-Diff). Specifically, multimodal residual mechanism is introduced to facilitate efficient information transfer among HR-MSI, LR-HSI, and HR-HSI, significantly improving the fusion efficiency. Edge-aware noise strategy is designed by exploiting the edge information of HR-MSI, which guides the model to prioritize high-frequency detail reconstruction by applying stronger noise perturbations to edge regions. In addition, we propose a Bilateral Attention Fusion UNet and design a multi-scale supervision mechanism to enable progressive reconstruction and collaborative optimization of spectral and spatial features. Extensive experiments demonstrate that our method achieves superior performance over existing approaches in both quantitative metrics and visual quality.
Paperid: 3281,   Poster  
Authors: Gabriele Pedroni, Rakshith Madhavan, Federica Arrigoni
Title: Solvability of the Viewing Graph Under the Affine Camera Model
Abstract: In this paper we focus on the viewing graph, which is used to represent cameras (as nodes) and their pairwise relationships (as edges) in the context of Structure from Motion. By analyzing this graph, it is possible to establish if the available pairwise relationships (e.g., fundamental matrices in the uncalibrated case) are theoretically enough to uniquely determine the cameras, in which case the graph is termed "solvable". Previous results considered calibrated and uncalibrated settings, whereas other camera models have not been explored in the context of viewing graph solvability: this work represents the first study under the affine camera model. We provide a characterization of the problem in terms of a linear system, from which we derive a practical method to check affine solvability. We complement this by some theoretical results providing sufficient/necessary conditions for affine solvability, in order to give further insights on the problem. Thanks to our experiments, we analyze synthetic graphs and real graphs coming from structurefrom-motion datasets, where we focus on understanding the differences among different camera models (calibrated, uncalibrated and affine) in terms of solvability. In this context, we also raise an open research question and conjecture a possible answer, which is supported by empirical evidence.
Paperid: 3282,   Poster  
Authors: Yueying Wang, Yiteng Guo, Weidong Zhang, Jie Wen, Liquan Shen, Huaicheng Yan, Xin Xu
Title: RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
Abstract: Underwater images commonly suffer from foregroundbackground ambiguity, loss of structural details, and severely reduced contrast, which collectively make underwater object detection (UOD) an inherently challenging task. To handle this issue, we present a residual-guided hierarchical calibration network (RHCNet) designed to achieve more efficient and robust UOD, which comprises a residual-guided feature enhancement module (RGFE) and a hierarchical feature calibration pyramid module (HFCP). Concretely, RHCNet extends the standard ResNet-50 backbone by embedding the RGFE, which effectively strengthens the representation of edge and texture features in blurry regions by jointly leveraging convolutional operations and attention mechanisms to achieve more discriminative feature extraction for UOD. Subsequently, the HFCP integrates a bottom-up semantic enhancement path and a top-down fine-grained feature compensation path, while a K-means clustering–guided feature calibration module is jointly employed to ensure multi-level cross-scale semantic consistency and accurate alignment of salient region features. Extensive experiments on the DUO and UTDAC benchmark datasets demonstrated that our RHCNet attains the highest AP scores of 70.53% and 53.35%, respectively. Besides, our RHCNet also maintains excellent detection accuracy and strong generalization capability on the COCO dataset for terrestrial scenarios.
Paperid: 3283,   Poster  
Authors: Chenxu Wang, Kai Zhang, Jian Yang
Title: Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank
Abstract: Allin-one image restoration aims to recover clean images from heterogeneous degradations with a single model, but joint training on multiple degradations with a shared backbone often induces cross-task interference and unstable optimization, making it hard to maintain strong performance across all tasks. To address this, we propose Retrieve-to-Restore (R2R), a lightweight framework that decouples degradation adaptation from backbone computation through a retrieval-based degradation bank. Specifically, R2R externalizes degradation knowledge as unified, degradation-specific priors stored in a compact Degradation Bank. A Degradation Amalgamator aggregates GT-guided intra-class features into task-level clean priors during training, while Degradation Matching retrieves the most relevant prior at inference to modulate backbone features for restoration. This retrieval-guided design explicitly separates degradation cues from shared reconstruction capacity, enabling stable multi-degradation training and straightforward scaling to additional degradation types. Extensive comparisons on benchmarks with one, three, and five degradations show that R2R achieves PSNR on par with state-of-the-art all-in-one methods while using about 91% fewer MACs. Our code and models will be made publicly available.
Paperid: 3284,   Poster  
Authors: Xuanzuo Lin, Min Zhang, Daizong Liu, Zhiwen Zuo, Xun Yang, Changting Lin, Xun Wang, Jianfeng Dong
Title: CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
Abstract: Interactive Textto-Image Retrieval (I-TIR) aims to refine image retrieval results through natural language dialogues, which allows users to progressively supplement or correct their search intention across multiple rounds, enabling a more precise and user-aligned visual search experience.However, existing methods perform cross-modal retrieval within a fixed multimodal feature space, mapping all dialogue text and images onto the same static embedding manifold.Such static formulation easily causes semantic vagueness, making it difficult to capture subtle embedding shifts in the user's updated intention for fine-grained retrieval.To address this limitation, we propose Context-Aware Latent Space Transformation (CAST), a lightweight framework that dynamically transforms the common latent space of both textual and visual representations according to the specific evolving user's search intention, enabling fine-grained and adaptive semantic alignment.The core of CAST is the Context-Aware Space Regulator (CASR), a crucial space transformation module composed of two key components: (1) the Context-Aware Low-Rank Projector (CLP), which learns to predict the projection direction of embedding space based on the intent's context;and (2) the Context-Guided Modulator (CGM), which adaptively determines appropriate projection strength. CASR is highly lightweight, adding negligible parameters and computational overhead, and can be seamlessly integrated into diverse I-TIR frameworks.Extensive experiments demonstrate the effectiveness of our proposed framework, indicating that it can serve as a general, plug-and-play solution for efficient and scalable interactive text-to-image retrieval. Our source code is provided in the supplementary material.
Paperid: 3285,   Poster  
Authors: Xuze Li, Haozhao Wang, Zhenyu Huang, Zhongxu Wang, Zhang Jinghua, Ruixuan Li
Title: MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
Abstract: Large Vision Language Models (LVLMs) with retrievalaugmented generation (RAG) are emerging as a main paradigm for processing vision-language medical tasks due to their promising achievements. However, existing approaches exhibit two significant limitations in the two key stages—retrieval and generation. First, during the retrieval stage, most methods typically rely on a single similarity signal to estimate document relevance, ignoring the rich information available in multimodal data, which may fail to accurately retrieve matching content. Second, in the generation stage, retrieved documents are integrated directly and uniformly into the input for LVLMs, without taking into account their varying relevance to the question, which may result in the dilution of crucial information and exacerbate the negative impact of irrelevant content. To address these limitations, we propose MR-RAG, a dual-stage RAG enhancement framework by considering multimodal relevance in both retrieval and generation phases. Specifically, we first introduce a Multimodal Cooperative Retrieval (MCR) module that leverages both intra-modal and cross-modal signals to jointly retrieve semantically aligned documents. Then, we design an Importance-Aware Information Flow Augmentation (IFA) mechanism that augments attention paths based on the fused multimodal relevance, enabling more precise control over the information flow during answer generation. By coherently bridging retrieval and generation via multimodal signals, our method significantly enhances factual accuracy and robustness. Experiments on three medical datasets demonstrate that our method outperforms state-of-the-art baselines, achieving up to 6.4% accuracy improvement.
Paperid: 3286,   Poster  
Authors: Meihong Pan, Yefeng Zheng
Title: Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
Abstract: Medical Visual Question Answering models often face potential traintest distribution shifts that hinder generalization across unseen imaging and linguistic patterns. To address this challenge, we propose a dual-level confidence based framework (DuCoR) that achieves implicit self-refinement through iterative pseudo-supervised optimization. Instead of relying on fixed pseudo answers, the model progressively refines its predictions by estimating their reliability from two complementary perspectives. A loss-level confidence captures the reliability of supervision by modeling clean and noisy loss distributions, while a feature-level confidence measures the semantic coherence between sample representations and their pseudo-answer conditioned prototypes. Since these two confidences originate from distinct information sources, including the supervision signal and the input semantics, they provide mutually corrective cues. They are adaptively fused to derive per-sample reliability weights that guide pseudo-supervised optimization toward better alignment with the target distribution. Extensive experiments on multiple Med-VQA benchmarks show that our method achieves superior performance and exhibits improved cross-domain generalization over fully supervised baseline.
Paperid: 3287,   Poster  
Authors: Boyuan Cheng, Yingjie Xi, Rui He, Jinhe Na, Ying Cao, Pengjie Wang, Jian Zhang, Xiaosong Yang
Title: Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
Abstract: Animation relies heavily on effective cinematography to enhance narrative clarity and emotional resonance, yet crafting optimal character interactions and camera positioning remains a resourceintensive challenge. Existing methods typically require extensive, predefined datasets, which restrict their effectiveness when encountering unfamiliar character interactions or novel animation contexts. We introduce an innovative approach to jointly generate character interactions and camera placements through unconditional diffusion-based generative models. Our method leverages a unified framework to simultaneously synthesize realistic two-person motions and corresponding cinematographic compositions without relying on predefined visual datasets. By integrating 3D motion representations and Toric features, our diffusion model effectively captures spatial orientation and relative positioning, enabling coherent and expressive scene generation. Experiments demonstrate that our approach can autonomously produce diverse and plausible dual-character interactions coupled with compelling camera movements, enhancing creative flexibility in animated storytelling.
Paperid: 3288,   Poster  
Authors: Hanxi Liu, Yifang Men, Zhouhui Lian
Title: FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
Abstract: Singleimage 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortions and texture blurriness, due to the lack of 3D awareness within the generated frames. To address this, we introduce a novel 4D remeshing module that explicitly disentangles the learning of the globally shared canonical mesh and transient variations by tracking per-vertex deformations under different viewpoints. The topological consistency of the deformed meshes inherently enables the optimization of a unified UV representation that effectively integrates appearance attributes across frames. Both qualitative and quantitative experimental results demonstrate the superiority of our method over prior works in terms of appearance realism, geometric fineness, and generalization diversity. We also showcase the applicability of our reconstructed avatars for downstream applications including animation and 3D editing.
Paperid: 3289,   Poster  
Authors: Kiseok Choi, Hyeongjun Cho, Inchul Kim, Min H. Kim
Title: Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
Abstract: Xray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruction. Our study not only identifies how pose perturbations propagate through the differentiable projection operator but also reveals why splat-based CT is particularly vulnerable to geometric misalignment. The resulting formulation remains lightweight and easily integrable into existing pipelines while substantially improving reconstruction fidelity under real-world sparse-view conditions.
Paperid: 3290,   Poster  
Authors: Junshu Zhang, Sicheng Zhao, Xin Zhao, Fan Yang, Ruike Chen, Jungong Han, Guiguang Ding
Title: Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection
Abstract: Bird’sEye-View (BEV) detection has become a dominant paradigm for 3D object detection in autonomous driving, due to its strong perception capability. However, most existing methods mainly focus on constructing high-quality BEV feature representations, while neglecting the design of task-specific detection heads. In practice, they directly adopt the center-based head originally developed for 2D detection, without any specific optimization. This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression(NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. Spe-BEVHead introduces three BEV-specific adaptations: (1) a Rotated Box Kernel that generates geometry-aligned classification weights, (2) a Local Response Refinement Module (LRRM) that suppresses non-peak responses and improves end-to-end performance, and (3) a dual-branch architecture that provides richer supervisory signals to promote more robust learning while inherently preserving the performance for end-to-end inference. Extensive experiments show that Spe-BEVHead can be seamlessly integrated into existing BEV backbones, delivering direct performance gains while retaining competitive performance under the challenging end-to-end setting.
Paperid: 3291,   Poster  
Authors: Zijun He, Ping Wang, Xiaodong Wang, ChangChen ChangChen, Xin Yuan
Title: Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding
Abstract: Coded Aperture Snapshot Spectral Imaging (CASSI) is an emerging hyperspectral image (HSI) acquisition technique for downstream semantic segmentation. Due to the illposedness nature of CASSI systems, typical solutions are compelled to conduct a two-stage reconstruction-then-segmentation pipeline, namely viewing them as two separate tasks. However, we observe that such two tasks are interrelated and mutually reinforcing for representation learning, and thus separating them limits the overall accuracy and efficiency. To this end, we propose the first Cooperative Reconstruction-Segmentation Deep Unfolding Network (CRSDUN) to solve the reconstruction and segmentation tasks in parallel. To make the two mutually reinforcing, we introduce the Cross-Aggregated Super-Token Attention (CASTA) mechanism to enhance the representation interactions between HSI reconstruction and semantic segmentation. Extensive experiments on both synthetic and real-world HSI reconstruction-segmentation datasets demonstrate that our method achieves state-of-the-art in both spectral reconstruction and semantic segmentation. The code and models will be released publicly.
Paperid: 3292,   Poster  
Authors: Yimin Liu, Nan Pu, Fengxiang Yang, Wenjing Li, Zhihui Li, Zhun Zhong
Title: SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification
Abstract: Person DeReidentification (De-ReID) is an emerging and safety-critical task that aims to selectively forget specific individuals in surveillance systems while preserving the recognition capability for others. Existing methods typically learn both forgetting and retaining objectives within a unified feature space, which leads to conflicting optimization goals and may cause unexpected performance degradation on novel retaining identities. Although decoupling pre-trained feature space for forgetting or retaining purpose is a promising solution, discriminating which feature space should be used for the given novel query remain unsolved. To alleviate these challenges, we propose SANER, advancing De-ReID with switchable adapter (SA) and test-time non-parametric enhanced routing (NER) algorithm. SA decouples the pre-trained feature space into two task-specific subspaces with forgetting adapter (F-Adapter) and retaining adapter (R-Adapter). The former suppresses identity-specific semantics for de-identification, while the latter preserves discriminative cues for accurate re-ID. Moreover, SA is further enhanced with NER to adaptively analyze optimal feature space routing for the given novel query at test-time. Specifically, NER compares queries with pre-computed prototypes in the original feature space, mitigating the potential training–testing gap and thus ensures accurate routing for De-ReID. Extensive experiments on multiple De-ReID benchmarks demonstrate SANER's efficacy, providing a new perspective for privacy-preserving visual perception.
Paperid: 3293,   Poster  
Authors: Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim
Title: M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
Abstract: In largescale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. However, existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy. To address this problem, we propose \ours which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) Multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy. \ours achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5--39.6%, retrieval nDCG by +1.1--15.3%, and QA ANLS by +4.5--15.3%. These results demonstrate that modeling document-level dependencies with Multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.
Paperid: 3294,   Poster  
Authors: Antonio Luigi Stefani, Niccolò Bisagno, Nicola Conci, Eckehard Steinbach, Francesco De Natale
Title: Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
Abstract: We address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? Stateof-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback.
Paperid: 3295,   Poster  
Authors: Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang
Title: TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
Abstract: Novel view synthesis from sparseview inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.
Paperid: 3296,   Poster  
Authors: Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
Title: MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents
Abstract: We introduce MMBenchGUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. The benchmark spans four levels: Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. To assess both effectiveness and efficiency, we further propose the Efficiency–Quality-Aware (EQA) metric, which measures task success alongside action redundancy. Extensive evaluations reveal that precise visual grounding is the critical determinant of performance, underscoring the advantages of modular designs with specialized grounding modules. Moreover, all agents suffer from substantial inefficiencies, frequently completing tasks with excessive steps despite eventual success. Performance also degrades on complex or cross-application tasks, exposing weaknesses in memory, planning, and adaptive reasoning. By providing broad coverage, standardized protocols, and novel metrics, MMBench-GUI establishes the first comprehensive foundation for advancing GUI agent research.
Paperid: 3297,   Poster  
Authors: Jie Xiao, Yinchao Ma, Yuyang Tang, Dengqing Yang, Jianpeng Yang, Xu Zhou, Qiao Li, Wenfei Yang, Tianzhu Zhang
Title: Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
Abstract: 3D single object tracking (SOT) in point clouds is a fundamental component of autonomous perception but remains challenging due to sparse observations, irregular geometry, and frequent occlusion. Most prior methods adopt a categoryspecific paradigm, requiring individual models for different object types. This design hinders scalability and generalization, as object categories in the real world exhibit vast variations in scale and structure. In this work, we present UniKPT, a category-unified and structure-aware framework that performs robust 3D tracking across diverse object classes without relying on category priors. UniKPT introduces three key innovations: (1) an adaptive structural keypoint extractor that identifies scale-consistent and semantically meaningful points; (2) a progressive correspondence aligner that enforces hierarchical geometric consistency across frames; and (3) a confidence-aware localization module that adaptively refines tracking by suppressing uncertain correspondences and exploiting inter-keypoint structural relations. Experiments on the nuScenes and KITTI benchmarks demonstrate that a single UniKPT model not only generalizes across categories but also outperforms state-of-the-art category-specific trackers, achieving gains of +4.37% in Success and +5.16% in Precision on nuScenes.
Paperid: 3298,   Poster  
Authors: Jiawei Yu, Zijian Gao, Xingxing Zhang, Xuan Liu, Huaimin Wang, Kele Xu
Title: Decouple Your Discovery and Memory in Continual Generalized Category Discovery
Abstract: Continual Generalized Category Discovery (CGCD) seeks to incrementally discover new categories from unlabeled data and memorize old categories’ knowledge, fostering model adaptability in real-world scenarios. Especially, the unlabeled data is from both old and new classes, requiring the model to recognize previously learned classes while discovering. In response, recent efforts focus on devising specific frameworks and various anti-forgetting strategies, striving for a typical stability-plasticity trade-off. Unlike previous studies, in this work, we first revisit these methods and identify that most of these methods over-protect old classes, hampering the accurate discovery of novel ones. To address this challenge, we introduce the Decouple Your Discovery and Memory (DYDM), a dual-branch architecture that decouples the discovery of new classes and the memorization of old classes. The discovery branch is focused on accurately recognizing new classes, while the memory branch consolidates all identified categories in a recursive manner and functions as the inference branch. Importantly, benefiting from the strong knowledge retention ability of the memory branch, the discovery branch can facilitate the recognition of novel classes from the unlabeled data, achieving a win-win outcome between plasticity and stability. Extensive experiments on various datasets and settings demonstrate the superiority of our approach, achieving leads of up to 9.87%, 7.30%, 3.18%, and 8.25%. Furthermore, our framework can integrate with existing approaches, consistently enhancing their performance.
Paperid: 3299,   Poster  
Authors: Khanh-Binh Nguyen, Chae Park
Title: SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Abstract: Largescale pre-trained image-text models exhibit robust multimodal representation, yet applying contrastive language-image pretraining (CLIP) to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A])struggles to capture semantic cues, and the prompt “a photo of a [V_A]” fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose sound-aware prompt learning (\textscSouPLe), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench confirm that \textscSouPLe significantly improves localization and segmentation performance.
Paperid: 3300,   Poster  
Authors: Yiming Li, Sisi You, Bing-Kun Bao
Title: Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
Abstract: In recent years, while Scene Graph Generation has advancedsignificantly, mainstream methods remain constrained by predefined object and relationship categories, limiting general-ization to open real-world scenarios. Inspired by open vocab-ulary object detection, recent efforts have expanded SGG tothe open vocabulary domain. However, these models oftenrely on off-the-shelf VLMs, lacking discriminative attributeextraction and suffering from limited object-relationship se-mantic interaction, which leads to misclassification of un-seen categories. To address these issues, we propose the MoEFeature Decoupling (MoE-FD) framework for Open Vocab-ulary Scene Graph Generation. MoE-FD adaptively learnsfeature decoupling for objects and relationships via multipleexperts, prioritizing critical features through gating networkweights. Moreover, it models semantic interactions betweenobjects and relationships using iterative cross-attention, en-hancing relationship triple associations and visual-semanticalignment. The main contributions of MoE-FD are threefold:(1) A MoE-based feature decoupling framework that adap-tively enhances discriminative feature representation for ob-jects and relations. (2) Semantic interaction modeling be-tween objects and relations to strengthen relationship tripleassociations and image-text alignment accuracy. (3) Exten-sive experiments demonstrate the effectiveness of MoE-FDon the Visual Genome dataset.
Paperid: 3301,   Poster  
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Qian, Kuan-Chieh Wang, Egor Nemchinov, Moayed Haji Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
Title: Omni-Attribute: Open-vocabulary Image Attribute Encoder for Visual Disentanglement and Composition
Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from generalpurpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
Paperid: 3302,   Poster  
Authors: Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng
Title: MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction
Abstract: We introduce MotionCrafter, the first video diffusionbased framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization.
Paperid: 3303,   Poster  
Authors: Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang
Title: Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Abstract: We introduce Archon, a unified multimodal framework that extends multimodal language models to address the fundamental challenge of holistic digital human generation. Archon unifies diverse humancentric modalities, including description, script, speech, animation, semantic segmentation, image and video, within a single controllable generative system, enabled by modality-specific tokenization and auto-regressive cross-modal reasoning. For high-quality video outputs, we incorporate a semantic-driven video diffusion decoder that reconstructs photorealistic video from compact representations. We further analyze cross-modality ambiguity and explore alternative modality generation chain that improves controllability and coherence. Experiments demonstrate strong performance across diverse multimodal generation tasks without task-specific fine-tuning.
Paperid: 3304,   Poster  
Authors: Xiaojun Chen, Sixiao Luo, Ziqi Liu, Min Yang, Qin Zhang, LiangJie Zhang
Title: ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering
Abstract: Chart Question Answering (CQA) benchmarks are critical for evaluating Multimodal Large Language Models (MLLMs) on visual data reasoning. Existing benchmarks focus mainly on finalanswer correctness, ignoring intermediate reasoning steps and the propagation of errors in multi-step processes. To address this, we introduce ChartR, a benchmark designed to assess both the accuracy and robustness of reasoning in chart-understanding tasks. Each question is decomposed into 4–10 sub-questions covering key reasoning types, and each chart includes four visually perturbed variants (blurred, noise-added, watermark-added, annotation-removed) to systematically evaluate robustness. ChartR contains 200 base charts, 800 variants, 1,652 questions, and 8,260 image–question pairs. We further propose a comprehensive evaluation framework with eight metrics that evaluate reasoning-chain accuracy, robustness under visual perturbations, and enable analysis of potential error propagation patterns. Experiments on twelve MLLMs, including general-purpose and chart-specialized models, reveal low reasoning reliability, early-step errors that may propagate, value extraction as the primary bottleneck, and sharp performance drops under perturbations, highlighting reliance on textual cues over true visual understanding.
Paperid: 3305,   Poster  
Authors: HAO ZHANG, Shuhan Yang, Linfeng Tang, Xunpeng Yi, Jiayi Ma
Title: ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling
Abstract: Existing methods following the integrated hardregression or decoupling optimization paradigms exhibit limited fusion performance under complex degradations. To address these paradigm-level shortcomings, we propose ReCoFuse, an ultra-robust image fusion framework based on restorative multi-modal diffusion reciprocal coupling. ReCoFuse redefines the relationship between information restoration and integration, deriving a novel reciprocal coupling optimization paradigm through their mutual reinforcement. It first constructs two restoration branches using diffusion modules (DiM) to capture modality-specific restoration priors. Then, time-aware cross-modal integration modules (TIM) are introduced as a bridge to couple restoration and integration, embedded at each DiM sampling timestep to aggregate multi-modal information. The aggregated variable not only feeds back to each restoration branch to enhance degradation removal via cross-modal complementarity, but also generates high-quality fused images that comprehensively represent the scene. Moreover, an alternating regularization mechanism is designed to iteratively optimize DiM and TIM along the gradient path, ensuring effective collaboration between restoration and integration. Extensive experiments show that ReCoFuse achieves state-of-the-art performance under challenging degradations such as low light, haze, noise, low contrast, and stripes.
Paperid: 3306,   Poster  
Authors: Xinyu Zhou, Jiawei Zhang, Stephen Wright
Title: Smoothing the Score Function to Enhance Generalization in Diffusion Models
Abstract: Diffusion models achieve remarkable generation quality, yet face a fundamental challenge known asmemorization, where generated samples can replicate training samples exactly. We develop a theoretical framework to explain this phenomenon by showing that the empirical score function (the score function corresponding to the empirical distribution) is a weighted sum of the score functions of Gaussian distributions, in which the weights are sharp softmax functions. This structure causes individual training samples to dominate the score function, resulting in sampling collapse. In practice, approximating the empirical score function with a neural network can partially alleviate this issue and improve generalization. Our theoretical framework explains why: In training, the neural network learns a smoother approximation of the weighted sum, allowing the sampling process to be influenced by local manifolds rather than single points. Leveraging this insight, we propose two novel methods to further enhance generalization: (1)Noise Unconditioningenables each training sample to adaptively determine its score function weight to increase the effect of more training samples, thereby preventing singlepoint dominance and mitigating collapse. (2)Temperature Smoothingintroduces an explicit parameter to control the smoothness. By increasing temperature in the softmax weights, we naturally reduce the dominance of any single training sample and mitigate memorization. Experiments across multiple datasets validate our analysis and demonstrate the effectiveness of both methods in improving generalization while maintaining high generation quality.
Paperid: 3307,   Poster  
Authors: Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng
Title: CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
Abstract: With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on opensource video generation models with quality far below commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real–fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos, establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework’s robustness and generality. Dataset and code will be released.
Paperid: 3308,   Poster  
Authors: Minhyeok Lee
Title: Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
Abstract: Modern computer vision models from different architecture familiesCNNs, Vision Transformers, and MLP-Mixers--achieve remarkably similar aggregate performance on standard benchmarks, masking potential systematic differences in how they process visual information. We introduce a simple yet revealing framework to identify where architectural inductive biases truly matter: by systematically mapping controversial images where pretrained models strongly disagree versus consensus images where all models agree. Analyzing 12 pretrained models spanning three architecture families on ImageNet validation set, we discover that controversial images exhibit approximately 4.5× higher disagreement than consensus images (Controversy Score: 4.46). Despite mean accuracy around 80%, models show structured disagreement patterns: within-family agreement exceeds cross-family agreement, with CNNs and ViTs forming distinct clusters while MLPs show lower overall alignment. Crucially, only the top 10% most controversial images drive the majority of architectural divergence, constituting a small but informationally dense subset that reveals fundamental differences masked by aggregate metrics. Our analysis demonstrates that architectural choice matters most on this concentrated controversy space, providing researchers with actionable guidance for model selection and ensemble construction.
Paperid: 3309,   Poster  
Authors: YICHEN PENG, Jyun-Ting Song, Siyeol Jung, RUOFAN LIU, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani
Title: DyaDiT: A Multi-Modal Diffusion Transformer for Socially-Aware Dyadic Gesture Generation
Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker’s motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multimodal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
Paperid: 3310,   Poster  
Authors: Quynh Phung, Sandesh Ghimire, Minsi Hu, Charles Tsai, Jia-Bin Huang
Title: UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
Abstract: Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentationbased supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.
Paperid: 3311,   Poster  
Authors: SANKARSHANA VENUGOPAL, Mohammad Mostafavi, Jonghyun Choi
Title: DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
Abstract: Diffusionbased image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding exact 1^\textst- and 2^\textnd-order solutions. This reduces NFEs by up to 5× while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2^\textnd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256×256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability.
Paperid: 3312,   Poster  
Authors: Martine Hjelkrem-Tan, Marius Aasan, Rwiddhi Chakraborty, Gabriel Arteaga, Changkyu Choi, Adín Ramírez Rivera
Title: MIM Representations Encode Non-Semantic Noise: Post-Hoc Suppression Boosts Zero-Shot Performance
Abstract: Masked Image Modeling (MIM) has become a ubiquitous selfsupervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantic Orthogonal Projection (SOaP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOaP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.
Paperid: 3313,   Poster  
Authors: Ziqian Yang, Xinqiao Zhao, Xiaolei Wang, Quan Zhang, Jimin Xiao
Title: Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
Abstract: Imagelevel Weakly Supervised Semantic Segmentation (WSSS) typically leverages Class Activation Maps (CAMs) for pixel-wise localization. However, existing CLIP-based methods often yield under-activated CAMs, primarily due to the inaccurate semantic relationships in the affinity-based refinement. In this work, we propose a novel framework, CD-CLIP (Class Distribution based CLIP), which addresses this issue by introducing a Class Distribution Aware (CDA) module. The CDA module captures richer semantic relationships by modeling patch-wise distributions across all classes using Jensen-Shannon divergence, thereby enhancing the completeness of CAMs. While this significantly improves the coverage of the foreground class, the over-activation at class boundaries might also exist due to the comprehensive integration of relationships between inter target classes. To mitigate this adverse effect on segmentation supervision, we introduce a Super-class Boundary Exploration (SBE) module, which leverages structural knowledge of DINO to generate boundary-aware super-class prototype CAMs. By employing the boundary-enhanced loss, our SBE module effectively provides accurate boundary supervision for the final segmentation. Our proposed CD-CLIP framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.
Paperid: 3314,   Poster  
Authors: Nurjahan Sultana, Moi Hoon Yap, Xinqi Fan, Wenqi Lu
Title: CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
Abstract: Models for AIbased skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) photos, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically "edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation "bakes" the clinical reasoning into the student's weights.On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. The code will be made publicly available upon acceptance.
Paperid: 3315,   Poster  
Authors: Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang
Title: Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
Abstract: VisionLanguage Models (VLMs) are frequently undermined by object hallucination—generating content that contradicts visual reality—due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail—all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
Paperid: 3316,   Poster  
Authors: Ziliang Chen, Yulu Li, Liangda Fang, jusheng zhang, Yongsen Zheng, Quanlong Guan, Xipeng Chen
Title: Vocabulary Scaling Law : Tuning Open-vocabulary Predictors for Their Openness
Abstract: Openvocabulary learning on CLIP provides remarkable generalization on diverse concepts, however, falters under the realistic streaming open-world evaluations for Stability against distractor classes and Extensibility to novel classes. Current fine-tuning methods often fail these tests since they are mainly designed for closed-set conditions, leading to the performance gaps while the target vocabulary progressively scales. We formalize a ``vocabulary scaling law'' showing that these openness measures can be lower-bounded by performance on the full class-name universe, implying that robust fine-tuning should: (i) account for the entire vocabulary, (ii) tune class-name embeddings rather than context, and (iii) enforce orthogonality between prompt embeddings including training and open-set class names. Guided by our analysis, we propose Submodular-Vocabulary Fine-tuning (SVFT), a bi-level optimization framework that approximates the intractable objective of tuning all class name embedding by greedily selecting a small, informative subset of class names via constrained submodular maximization, thus, allows the employment of efficient greedy algorithm for the near-optimal class-name subset selection to fine-tune CLIP instead of using all open classes. Across extensive experiments, SVFT consistently improves both stability and extensibility, advancing the openness and practical robustness of CLIP-based vision–language models.
Paperid: 3317,   Poster  
Authors: Jie Long Lee, Gim Hee Lee
Title: ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation
Abstract: Estimating dense three dimensional motion in dynamic high speed scenes remains challenging due to motion blur, illumination variation, and the limited temporal resolution of conventional cameras. We introduce ARES, a unified framework for Asymmetric RGBEvent Stereo that addresses these issues through a hybrid setup where an event camera captures fine grained temporal dynamics and an RGB camera provides rich spatial structure. To integrate these heterogeneous modalities, we propose Multimodal Contextual Attention, a transformer based fusion mechanism that attends to spatial and temporal contexts under cross view constraints and forms a unified correspondence space for disparity and optical flow estimation. Building on this shared representation, we introduce Temporal Disparity Posterior Fusion, a probabilistic framework that models the evolution of disparity posteriors to infer disparity change and recover metrically coherent scene flow. Trained with sparse supervision and dense self consistency cues, our ARES achieves geometrically consistent and temporally stable three dimensional motion estimation across diverse driving scenarios. Experiments show that ARES attains state of the art performance in scene flow estimation, establishing a principled path toward unified asymmetric multimodal stereo sensing. Our code will be released upon paper acceptance.
Paperid: 3318,   Poster  
Authors: Qianpeng Chong, Wenyi Zeng, Xiuxuan Shen, Jiajie Li, Qian Yin, Xin Zheng
Title: SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
Abstract: As an emerging multigranularity clustering paradigm, granular-ball computing (GBC) hierarchically represents samples through granular-balls (GBs) to capture compact, multi-scale features. Nevertheless, its effective application to clustering-based segmentation methods (CSMs) remains challenging due to two key issues: representing intrinsic uncertainties and defining a justifiable, semantics-aware quality criterion. To address them, the first segmentation framework based on GBC (SegGBC) is proposed to alleviate the single‑granularity limitation of existing CSMs. Concretely, we leverage intuitionistic fuzzy sets (IFS) to explicitly quantify image uncertainty: membership and non‑membership encode evidence, and the IFS hesitation degree models residual ambiguity. In addition, a semantic compactness metric criterion (SCM_GB) is designed to characterize semantic information by considering the ''stable region'' in conjunction with the overall density of GBs. The proposal of ''stable region'' ensures robust semantics concurrently with high computational efficiency. Extensive experiments demonstrate that the proposed SegGBC achieves promising performance for segmentation. The proposed segmentation GB representation is a plug-and-play front-end, significantly boosting the performance of CSMs by >+3.25% SA and >+3.92% mIoU on standard image and COCO benchmarks. Code is available at supplementary material.
Paperid: 3319,   Poster  
Authors: Hyungjin Kim, Seokho Ahn, Young-Duk Seo
Title: Foundation Encoders are All You Need for Personalized Image Generation
Abstract: Personalized image generation based on user behaviors reflects individual preferences with minimal user intervention. However, existing studies often rely on inaccurate profiling, high resource costs, and modelspecific designs, which jointly restrict creativity, diversity, and generality. To address these limitations, we propose FANG, a novel approach that enables personalization using only foundation encoders, without additional structures. FANG performs tailored profiling to capture user preferences, and reconstructs transformer-based encoders to integrate them while preserving target fidelity. Experiments show that FANG achieves robust, high-quality personalization across various foundation text-to-image models and applications (e.g., CLIP retrieval, unCLIP, vision-language models), seamlessly integrating into diverse encoders without fine-tuning.
Paperid: 3320,   Poster  
Authors: XINHAO YAN, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, Chunchao Guo
Title: X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion
Abstract: Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing partbased generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.
Paperid: 3321,   Poster  
Authors: Zhenghui Zhao, Chen Wu, Xiangyong Cao, Di Wang, Hongruixuan Chen, Datao Tang, Liangpei Zhang, Zhuo Zheng
Title: ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng
Abstract: Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle eventdriven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event scenes that are both spatially and temporally coherent. The core idea is a drift-asynchronous diffusion bridge. Specifically, it consists of three main modules: a) Composed bridge initialization, which replaces noise initialization. It starts the diffusion from a composed pre-event state, modeling a diffusion bridge process. b) Asynchronous Drift Diffusion, which uses a pixel-wise drift map, assigning different drift magnitudes to event and temporal evolution. This enables differentiated generation during the pre-to-post transition. c) Drift-Aware Denoising, which embeds the drift map into the denoising network, guiding drift-aware reconstruction.Experiments show that ChangeBridge can generate better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. Additionally, ChangeBridge shows great potential for land-use planning and as a data generation engine for a series of change detection tasks.
Paperid: 3322,   Poster  
Authors: Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie
Title: A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
Abstract: In this paper, we propose a crossview fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views.Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy.To enable cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features.In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions.The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness.Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation.Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise.Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping.Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry.Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications.
Paperid: 3323,   Poster  
Authors: Shuwei Li, Lei Tan, Robby T. Tan
Title: White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation
Abstract: Color constancy aims to keep object colors consistent under varying illumination. Crosscamera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras.We propose VLM-CC, a vision-language model (VLM)-guided framework that formulates color constancy as an iterative refinement process.Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by VLM-based evaluation.At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback.This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence.Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression.By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets.
Paperid: 3324,   Poster  
Authors: Yipeng Wu, Xin WANG, Chenghan Yang, Chong Wang, Dongdong Wu, Wanchao Su, Hengshuang Zhao, Wei Feng, Kairui Yang, Di Lin
Title: RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
Abstract: Highquality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce RecEdit-Drive, a framework that integrates Spatial Feature Warping and Spatiotemporal Collaborative Modeling to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and the spatiotemporal collaborative modeling seamlessly integrates edited foreground objects into the background, yielding realistic and consistent edited videos. Besides, we design an inference strategy to reconstruct an accurate background structure through noise manipulation, providing a reliable reference for foreground instance editing at early denoising stages. We perform extensive qualitative and quantitative comparisons regarding general video editing and downstream tasks on the public datasets, demonstrating the state-of-the-art performance of our proposed method.
Paperid: 3325,   Poster  
Authors: Junfeng Zhang, Zhe Xue, Yuankai Qi, Junping Du, Xiangyang Kong, Yishuo Yan, Amin Beheshti, Jian Yang, Anton van den Hengel, Ming-Hsuan Yang
Title: POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Abstract: Most of the models used to generate embeddings for retrieval are not trained for the purpose which leads them to focus on coarse semantic alignment rather than particular object attributes or arrangements. This limits their performance, particularly on challenging problems such as crossmodal fine-grained retrieval. Furthermore, their training objectives lack the discriminative ability required to distinguish between descriptions that are semantically similar but factually different. To address these challenges, we propose POGA (Paraphrased and Oppositional Graph Alignment), a novel framework for fine-grained cross-modal alignment. POGA comprises two core innovations: (1) Multi-source Graph Augmentation (MSGA), which not only generates paraphrased positives and oppositional negatives, but also parses the image and all text variants into structured graphs to provide difference-rich supervisory signals; (2) Hybrid Multi-granularity Alignment (HMA), which defines a composite training objective that jointly optimizes the model at four distinct granularities: including robust dual global alignment, and precise matching at three fine-grained levels: node, relation, and focal disproving. Experiments on benchmarks such as DCI and DOCCI demonstrate that POGA performs favorably against several state-of-the-art methods in long-text understanding and complex relation discrimination.
Paperid: 3326,   Poster  
Authors: Le Yang, Hongping Gan
Title: Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing
Abstract: In the field of Compressive Sensing (CS), deep unrolling networks (DUNs) have demonstrated exceptional performance and interpretability by integrating traditional optimization solvers with deep networks. However, existing DUNs suffer from homogenization in crossstage feature extraction and insufficient integration of gradient-guided information. Additionally, the feature extraction module struggles to balance the global receptive field and computational efficiency, which limits improvements in image reconstruction details. To address these challenges, we propose a multi-scale gradient-guided unrolling architecture with adaptive Mamba for CS, named MambaCS. Specifically, we utilize our customized Adaptive State-Space Block (A-SSB) to unroll the well-known Proximal Gradient Descent (PGD) algorithm across multiple feature levels to extract comprehensive image features while maintaining computational efficiency. Moreover, we design a High-Dimensional Gradient Fusion (HDGF) that ensures the persistent and stable injection of gradient-guided information across various scales and dimensions, while effectively eliminating information bottlenecks. Finally, we develop a Feature-Adaptive Proximal Operator (FAPO), using A-SSB as an extension of the sparse basis associated with the PGD proximal operator, which enhances sensitivity to multi-scale features and improves detail reconstruction. Extensive experiments demonstrate the signifcant advantages of our proposed MambaCS over the current SOTA methods.
Paperid: 3327,   Poster  
Authors: Raghav Magazine, Xingjian Li, Min Xu
Title: MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
Abstract: Saliencybased explainability methods are widely used to interpret deep learning models in medical imaging, yet many existing approaches rely on white box access of models, which is not always possible due to privacy concerns. In this work, we introduceMedLIME, a novel, model-agnostic explanation framework designed to enhance the robustness and fidelity of saliency maps for medical imaging abnormality localization. Building upon the Local Interpretable Model-agnostic Explanations (LIME) paradigm, MedLIME integrates three key components: (1)Generative Masking(GM), (2)Supervised Test-Time Adaptation(STT) and (3) aEvidence-based Regularization(EBR) to improve the saliency map generation accuracy of LIME. Extensive experiments on multiple medical datasets, across three model architectures demonstrate that MedLIME consistently outperforms gradient-based and perturbation-based baselines in abnormality localization as measured by AUPRC. Our results highlight that incorporating generative reconstruction, adaptive perturbation and data-driven regularization improves the reliability and interpretability of medical imaging models.
Paperid: 3328,   Poster  
Authors: Huayu Mai, Rui Sun, Yujia Chen, Wangkai Li, Bingzhou Wang, Aibing Li, Zhangyu He, Yuan Wang
Title: From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation
Abstract: The critical challenge of semisupervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. However, existing softmax scores-based filtering methods tend to be affected by the overconfidence issue in neural networks, leading to the inclusion of incorrect pseudo-labels that negatively impact the training process. In this paper, we propose a novel evidential learning framework to explicitly model the prediction uncertainty for reliable pseudo-label selection. By modeling the distribution of class probabilities using Dirichlet distributions, we obtain principled and improved uncertainty estimates from a distributional perspective. Furthermore, we propose HESS (Hyper-ESS), decoupling the modeling of exclusive and collective evidence for comprehensive evidence perception, to yield more accurate uncertainty estimates. Extensive experiments on three challenging benchmarks demonstrate that integrating HESS into existing semi-supervised semantic segmentation frameworks consistently improves performance, benefiting from more reliable pseudo-label selection. Our work sheds light on the potential of evidential learning in semi-supervised semantic segmentation and opens up new avenues for future research. Code and models will be made available to facilitate future research.
Paperid: 3329,   Poster  
Authors: Fadi Boutros, Eduarda Caldeira, Tahar Chettaoui, Naser Damer
Title: IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
Abstract: Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identityconditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDperturb, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDperturb perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results show that training FR on datasets generated using IDperturb leads to improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches. Code and generated datasets will be publicly released.
Paperid: 3330,   Poster  
Authors: Huan Zhang, Shuyu Dong, Yujin Zheng, Dingwen Wang, Shenghua Fan, Fan Lyu
Title: Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
Abstract: Recent advances in CLIPbased continual learning have shown the potential of leveraging pre-trained vision–language models for sequential tasks. However, existing methods overlook a key problem we call Asymmetric Drift. In unimodal CLIP-based continual learning, the visual branch undergoes stronger adaptation because the visual distribution shifts significantly, whereas the text branch remains relatively stable due to the low variance of textual prompts. This imbalance increases the modality distance and degrades cross-modal alignment over time.To address this issue, we propose CCA-CL, a framework that accumulates visual-textual covariance statistics across tasks and solves Canonical Correlation Analysis to compute a shared subspace. In this subspace, the distance between visual and textual features is minimized, enabling better alignment without modifying CLIP parameters. This also makes our method naturally compatible with exemplar-free CL settings.To further capture nonlinear relationships that linear Canonical Correlation Analysis hard to model, we introduce Random Fourier Projection as an extension.Experimental results demonstrate that CCA-CL effectively mitigates the asymmetric drift problem and achieves state-of-the-art performance on several benchmarks. Our code will be available.
Paperid: 3331,   Poster  
Authors: Zhan Wang, Wang Leiquan, Chunlei Wu, Yu Meng
Title: Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising
Abstract: Image denoising is a fundamental task in computer vision aimed at recovering clean images from noisecorrupted observations. While supervised deep learning methods achieve remarkable performance when trained on paired data with known noise levels, their real-world applicability is limited as noise characteristics are often unknown. Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. Building upon theoretical insights from Noisier2Noise, we rigorously derive a relationship between the noise level and the variance of the denoised image, enabling robust estimation via a deep learning model and a ternary search strategy. The estimated noise level is then used to synthesize training pairs for supervised denoising. Experiments demonstrate that our method outperforms existing unsupervised approaches and traditional noise estimation techniques, achieving performance competitive with—and in some cases surpassing—supervised methods trained with known noise levels. The proposed framework effectively overcomes the training data pair limitations of supervised approaches for unknown additive white Gaussian noise. Our code will be available.
Paperid: 3332,   Poster  
Authors: Weiyu Li, Antoine Toisoul, Tom Monnier, Roman Shapovalov, Rakesh Ranjan, Ping Tan, Andrea Vedaldi
Title: MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based DiTs
Abstract: We present MeshFlow, a new method for compressing and generating artistlike 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh connectivity, which, however, scales poorly due to the inference cost being quadratic in mesh size. AR methods also require discretizing the vertex coordinates, which introduces quantization errors and can cause vertex collapse. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space.This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified-Flow transformer, which generates all mesh vertices and edges in parallel. This model samples meshes 26× faster than the fastest AR generator while also achieving state-of-the-art accuracy across standard mesh-generation metrics.
Paperid: 3333,   Poster  
Authors: YiJun Sheng, Shipeng Zhu, Ruijia Zuo, Na Nie, Hui Xue
Title: MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents
Abstract: Chinese historical documents are essential carriers for the inheritance and dissemination of traditional Chinese culture.However, traditional manual digitization of different types of historical carriers is not only timeconsuming and labor-intensive but also heavily reliant on experts with specialized knowledge of the specific carrier domains.In the past,experts read the Chinese historical documents relying on the recognition of the documents and consulted a large number of professional books for citation and correction .With the emergence of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), we see new opportunities for uniformly reading different types of carriers. Nevertheless, existing studies mainly focus on evaluating the OCR capabilities of MLLMs, without incorporating citation or retrieval functionalities, and are restricted to a single type of carrier.To address this,we introduce MCHDoc,a comprehensive benchmark for reading multi-carrier Chinese historical documents.This benchmark consists of 15,723 documents and covers six types of carriers, including Inscription, AncientBook, Calligraphy, Oracle Bone, Silk, and JianDu(bamboo slip).Based on this benchmark,we evaluate various MLLMs and LLMs to test their capacities of reading multi-carrier Chinese historical documents. The results reveal that the top MLLMs and LLMs achieve excellent performance on some type of carriers. but there is still some place for them to read the multi-type carriers perfectly.Overall,MCHDoc is a standardized and comprehensive benchmark for reading Chinese historical document, providing valuable insights for Chinese cultural study.
Paperid: 3334,   Poster  
Authors: Guangzhi Wang
Title: FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing
Abstract: Given an image and a descriptive prompt, image inversion seeks to identify the initial noise that, when denoised, accurately reconstructs the original image. This is crucial for applications like image editing, which can be achieved by denoising the inverted noise with an edited prompt. Existing methods often rely on approximations. They often require many steps, leading to inaccuracies, slow processing, and artifacts due to the inherent intractability in the inversion process. To overcome these issues, in this work, we propose FlashIn, a novel algorithm for faster and more accurate image inversion, enabling highquality, real-time editing. FlashIn offers two main contributions: i) A learnable neural network directly maps an image to its corresponding noise. Trained with a cycle-consistent strategy using generated data and seed noise, this approach yields a more efficient and precise inversion model. ii) Adversarial training aligns noise-reconstructed images with real ones, enhancing inversion accuracy and editing quality. These strategies enable a fast, accurate inversion process in a single step, with further improvements possible through additional steps. Integrated with few-step diffusion models such as Flux.1-Schnell, our method achieves high-quality image editing within one second on a single A100 GPU, facilitating real-time, interactive editing. Extensive experiments demonstrate that FlashIn delivers state-of-the-art inversion precision and impressive editing results across various scenarios and applications.
Paperid: 3335,   Poster  
Authors: Linqing Wang, zhiyong xu, XiMing Xing, YIJI CHENG, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Zhenxi Li, Jiale Tao, wangqixun wangqixun, Ruihuang Li, Comi Chen, Xin LI, Mingrui Wu, Xinchi Deng, Shuyang Gu, Chunyu Wang, qinglin lu
Title: PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
Abstract: Recent advances in textto-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects such as attribute binding, negation, and compositional relationships. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pre-trained T2I model.Specifically, we adopt a multi-stage training pipeline to systematically boost the rewriter's understanding and rewriting performance. In the first stage, we conduct supervised fine-tuning (SFT) using CoT-enabled data to enable the rewriter to generate structured, chain-of-thought-style responses. In the second stage, we design a task-specific reward model—AlignEvaluator—to further align user prompts with fine-grained preferences through GRPO.The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy derived from common T2I failure cases. By optimizing the rewriter to maximize the reward from AlignEvaluator, our framework learns to generate prompts that T2I models can interpret more precisely. Furthermore, we introduce a comprehensive human-aligned benchmark to facilitate future research in this direction. Extensive experiments demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges.
Paperid: 3336,   Poster  
Authors: YITING LI, Xulei Yang, Jingyi Liao, Jing Zhang, Fayao Liu
Title: GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
Abstract: We address unsupervised multimodal anomaly detection (MAD) in few-shot regimes, where only a handful of normal exemplars are available per class. Existing approaches struggle with such data scarcity due to their incapacity in capturing the distribution-level information of normal appearance and geometry. To capture diverse and continuous normality variations, we propose GPFlow, a probability flow inspired framework that embeds diverse normal patterns into a latent space of learnable Gaussian prototypes. At its core, GPFlow uses an analytical Posterior‑Mean Path (PMP) router that iteratively moves features toward prototype‑centered high‑probability neighborhoods, acting as an explicit information bottleneck to prevent trivial reconstruction of anomalies. To exploit multi-modal cues, GPFlow employs a coupled reconstruction architecture enforces both intra- and cross-modal consistency at the prototype level. Finally, to handle distribution shift between sparse training samples and unseen test samples, GPFlow incorporates inference-aware prototype refinement to dynamically expand the prototypes' coverage to new normal variations during test time. Extensive experiments on MVTec‑3D‑AD and Eyecandies show that GPFlow achieves state‑of‑the‑art performance with only a few normal training samples, while remaining computationally efficient.
Paperid: 3337,   Poster  
Authors: Tingyun Li, Xinyi Liu, Yongjun Zhang, Yi Wan, Xiaoan Liu, Fan Weiwei, Jiahao Liu
Title: AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
Abstract: Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texturesparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S^2A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformation, and(iii) adapting scale between distant terrain and nearby objects.This design effectively stabilizes optimization under large motion and scale imbalance.Extensive experiments on UAV and driving benchmarks show that \modelname~achieves state-of-the-art rendering quality (PSNR/LPIPS), precise trajectory recovery (ATE/RPE), and faithful motion reconstruction, consistently surpassing recent pose-free baselines.
Paperid: 3338,   Poster  
Authors: yongxin yan, Weisen Chen, Xingye Chen, Yuanjie Shao, Zhengrong Zuo, Wenming Tan, Wenqi Ren, Changxin Gao, Nong Sang
Title: Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning
Abstract: FewShot Class-Incremental Learning (FSCIL) poses a critical challenge in machine learning, requiring models to continuously integrate novel classes with limited samples while preserving knowledge of previously seen classes. While existing FSCIL approaches have demonstrated promising results, they still suffer from catastrophic forgetting and few-shot overfitting due to the challenge of balancing old knowledge retention with new knowledge acquisition. To address these challenges, we propose an innovative Semantic-Guided Global-Local Collaborative Prompt Learning (SGLC) framework. Built upon powerful pre-trained Vision-Language Models (VLMs), the framework first introduces a dual-alignment mechanism: globally aligning visual features with visual-textual prototypes and locally aligning multi-view visual features with local textual attribute features, which facilitates effective knowledge learning while preserving existing knowledge via frozen prototypes of previous classes. Furthermore, to alleviate overfitting, we incorporate Large Language Models (LLMs) to generate semantically rich textual descriptions, which simultaneously guide both global and local prompt learning through knowledge distillation. Extensive experiments on the miniImageNet, CIFAR-100, and CUB200 datasets demonstrate that SGLC performs favorably against the state-of-the-art methods.
Paperid: 3339,   Poster  
Authors: Xiaoyu Han, Chenyang Wang, Jing Wang, Shunyuan Zheng, Quanling Meng, Shengping Zhang
Title: MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
Abstract: Virtual tryon aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.
Paperid: 3340,   Poster  
Authors: Jinhyeok Jang, Jaehong Kim, Jung Uk Kim
Title: Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
Abstract: Pretrained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our KNowledge-Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Na\"ive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer.
Paperid: 3341,   Poster  
Authors: Shiyu Qin, XINJIE ZHANG, Zhening Liu, Jinpeng Wang, Bin Chen, Jiawei Li, Yifan Ren, Shu-Tao Xia, Jun Zhang
Title: MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
Abstract: Stereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage crossattention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process and enhance the accuracy of probability distribution estimation, we introduce a bi-directional multi-reference entropy model that utilizes a checkerboard partitioning strategy and the stereo VSSB to get rich inter-view priors. Experimental results demonstrate that our MambaSIC outperforms the state-of-the-art methods in both rate-distortion performance and coding efficiency. Moreover, it achieves the smallest inter-view PSNR discrepancy, resulting in more balanced reconstruction quality.
Paperid: 3342,   Poster  
Authors: Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren
Title: Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Abstract: Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial–temporal consistency—constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multiview trajectories, amplifying cross-view inconsistency and temporal drift.We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency.Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.
Paperid: 3343,   Poster  
Authors: Lin Wang, Fang Liu, Xiaofen Xing, Kailing Guo, Xiangmin Xu
Title: CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition
Abstract: Facial expression recognition (FER) in the wild is severely hampered by label noise and annotation ambiguity. Existing methods, including sample selection, label ensembling, and consistency regularization, primarily rely on ordinary label supervision and offer limited control over nontarget predictions, leading to spurious activations and overfitting to noisy labels. To address this limitation, we propose a novel learning framework, named Complementary Label Exchange Learning (CLEX), enhances robustness by exchanging knowledge from non-target predictions across augmented views. Specifically, CLEX comprises three synergistic components. First, Stochastic Non-Target Logit Exchange randomly swaps a subset of non-target logits between original and augmented views to couple error-prone predictions, creating robust consistency constraints. Second, Scale-Invariant Logit Normalization eliminates magnitude artifacts through L_p-norm normalization, ensuring that regularization operates over geometrically meaningful directions rather than being dominated by arbitrary scales. Third, Complementary Suppression Loss selectively penalizes spurious activations over a randomly retained subset of non-target classes, avoiding the uniform shrinkage that hampers discriminative learning. To further stabilize training, we incorporate attention consistency regularization that enforces spatial alignment between augmented views, while retaining auxiliary cross-entropy to preserve semantic localization capability. Extensive experiments across multiple benchmark FER datasets (RAF-DB, FERPlus, and AffectNet) demonstrate that CLEX consistently outperforms existing robust FER learning approaches.
Paperid: 3344,   Poster  
Authors: Hao Sun, Yadong Huo, Qibing Qin, Wenfeng Zhang, Lei Huang
Title: Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
Abstract: Recent crossmodal hashing methods have introduced sample generation strategies to enrich training signals. Despite these advances, sample generation-driven hashing still faces two major challenges: (1) Interpolation-based methods adopt deterministic and class-independent generation that restricts synthetic samples to a small region around the original data. Consequently, intra-class diversity is limited, which weakens the model’s ability to learn discriminative binary codes. (2) Generation network-based methods, which leverage a complex generative model to produce synthetic samples, leading to extra model complexity. To address these issues, we propose a novel Intra-class Distribution-guided Generative Hashing (IDGH) that adaptively generates synthetic samples directly from estimated intra-class distributions. Specifically, we suggest an Intra-class Distribution Estimation (IDE) scheme to model the characteristic distribution of each class, providing essential support for adaptive sample generation. Meanwhile, by utilizing the distribution information from neighboring classes, we design a Neighbor-guided Distribution Refinement (NDR) mechanism to correct flawed estimations for classes. With refined intra-class distributions, we propose a Distribution-aware Adaptive Generation (DAG) strategy that synthesizes informative training samples by shifting features along diverse directions guided by intra-class distribution patterns. The proposed approach is plug-and-play and can be seamlessly integrated into various objective functions, providing semantically diverse training samples, thus enhancing similarity learning. Extensive experiments on benchmark datasets demonstrate that IDGH outperforms existing methods.
Paperid: 3345,   Poster  
Authors: Bingfeng Zhang, Siyue Yu, Hui Li, Jiahua Lin, Wenwu Wang, Jimin Xiao
Title: The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA
Abstract: Multimodal Large Language Models (MLLMs) like LLaVA have demonstrated remarkable capabilities in multimodal understanding and generation. This success motivates us to investigate whether the inherent prior knowledge embedded within such MLLMs contains sufficient spatial awareness for dense prediction tasks, without requiring any task-specific fine-tuning. Thus, in this paper, we explore the utilization of LLaVA for training-free open-vocabulary semantic segmentation. We discover that certain layers within the LLM part of LLaVA can generate localized features corresponding to given object classes. Building on this intrinsic capability, we design three modules: A question-answer pipeline to identify target classes in the image, a text-visual response module to extract initial reliable pixel-level activations for the target class, and a visual generation module to produce reliable refined prompts, which further serve as guidance for SAM to generate the predictions. Our LLaVA-based approach achieves new state-of-the-art performance on ``Thing" category datasets, \eg, PASCAL VOC 2012 and COCO-object. Moreover, our method does not require explicit background class names, demonstrating its exceptional potential for handling open-world scenarios. The code will be released.
Paperid: 3346,   Poster  
Authors: Yadong Liu, Qiaoqi Li, Yueying Wang, Lunke Fei, Jie Wen
Title: Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification
Abstract: While existing incomplete multiview multi-label learning methods have achieved promising performance, few studies have focused on the issue of multi-view imbalance. Existing methods using gradient modulation or alternating optimization strategies alleviate this problem but often oversimplify the interaction between views, resulting in persistently performance. In response to the challenge, we propose the Cross-view Distillation and Adaptive Masking (CDAM) framework, a novel approach designed to achieve balanced multi-view optimization for the challenging double incomplete multi-view multi-label learning tasks. First, to overcome the performance bottleneck of views, we design a cross-view distillation module. This module aligns low-quality student representations with high-quality teacher representations, thereby effectively mitigating the multi-view imbalance problem. Second, recognizing that distillation may not rectify all low-quality views, we introduce a subsequent adaptive masking module to perform an explicit quality assessment. This module dynamically identifies and masks out any remaining unreliable representations before multi-view fusion, thus preventing low-quality information from corrupting the fused representation. Extensive comparisons with nine state-of-the-art methods on six datasets validate the effectiveness and stability of our method.
Paperid: 3347,   Poster  
Authors: Jonghee Back, Jongju Kim, Jeong-Uk Kim, Eunjin Kim, Minyong Jeon
Title: IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
Abstract: Diffusion models have recently achieved remarkable success in realworld image super-resolution (ISR), typically balancing a trade-off between fidelity (i.e., similarity to HR images) and realism (i.e., perceptual naturalness). To better account for subjective preferences in image quality, controllable diffusion-based methods have been explored, allowing personalized adjustment of this trade-off via tunable parameters. While existing controllable methods have shown effective control, they operate in the latent space and require repeated network inference during adjustment, eventually limiting their practicality. In this paper, we propose IFCSR, a simple yet practical approach for one-step diffusion-based real-world ISR that enables inference-free control between fidelity and realism. The key idea behind IFCSR is to design a controllable model that adjusts the fidelity-realism trade-off in the image space, rather than in the latent space. Such an image-space control allows users to seamlessly adjust the trade-off without extra inference after an initial inference of fidelity- and realism-specific images. We further introduce a two-stage training scheme and specialized losses that encourage the controllable space to span a broad spectrum of fidelity and realism. Our method achieves quality competitive with state-of-the-art models while providing a practical advantage through inference-free control.
Paperid: 3348,   Poster  
Authors: Lu Xu, Guosheng Yin
Title: Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
Abstract: Fully decentralized deep learning removes global servers and ensures local data privacy. However, Euclidean consensus, averaging weights, gradients or momentum, may degrade under noni.i.d. data and client size imbalance. We propose a geometry-aware approach based on natural gradient variational inference. Clients communicate in the expectation parameter space of an exponential family, where simple linear mixing yields a forward KL barycenter consensus. The aggregate is the model closest to all client distributions, aligning updates across heterogeneous sites and mitigating distribution shift. We further provide a lightweight decentralized Adam implementation, in which each client maintains a diagonal-Gaussian posterior and both updates and gossips in the expectation space. We prove convergence for convex losses on connected graphs. On CIFAR-100 and a medical image segmentation benchmark, our method\footnoteAll code is included in the supplementary materials and will be publicly released. substantially outperforms Euclidean-space consensus baselines under severe non-i.i.d. and client-imbalance cases, achieving around 20% accuracy gain on CIFAR-100, while matching the communication budget and improving training stability.
Paperid: 3349,   Poster  
Authors: Takayuki Hara, Yuya Otsuka
Title: Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
Abstract: Recent advances in vision–language foundation models have enabled textdriven evaluation of image aesthetics and visual quality. However, existing models are typically optimized for fixed prompts or specific datasets, limiting their adaptability to diverse evaluation criteria. This paper presents Probabilistic Prompt Adaptation (PPA), a unified probabilistic framework that flexibly predicts aesthetic and quality scores conditioned on arbitrary text prompts. PPA formulates score prediction as a mixture over prompts, dynamically estimating prompt suitability based on both image content and task context. By marginalizing over prompts pre-sampled from a large language model (LLM), it enables annotation-free training using only triplets of task, image, and score. Experiments across multiple IAA and IQA benchmarks demonstrate that PPA achieves consistent and perceptually aligned prompt-based scoring, allowing fine-grained control over evaluation semantics.
Paperid: 3350,   Poster  
Authors: Xiaowan Hu, Jing Yang, Henan Liu, Li Huaqiu, Mai Xu
Title: Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration
Abstract: Zeroshot image restoration provides a flexible way to handle diverse degradations without task-specific training. However, existing methods typically rely on stacked layers or pre-trained features to enhance degradation expression, while overlooking physically consistent priors. The insufficient degradation prompts impose the heavy training burden and high sampling costs during zero-shot diffusion. Moreover, the fixed inference trajectory often collapses to suboptimal solutions under complex corruptions. We observe that heterogeneous degradations can be reparameterized into a minimal set of physically coherent parameters for compact representation. Based on this insight, we first propose a unified physical zero-shot image restoration (UP-ZeroIR) framework that explicitly models heterogeneous degradations into a homogeneous all-in-one distribution. The distribution can be optimized directly in the latent space, enabling principled solution exploration and effective prompt adaptation. Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. Extensive experiments demonstrate that our method achieves state-of-the-art performance across both single and mixed degradations. The code will be publicly available soon.
Paperid: 3351,   Poster  
Authors: Zekun Li, wang ning, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Title: SparseVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Abstract: Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative nextscale prediction paradigm.However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency.Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality.To address these problems, we present SparseVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality.Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales.Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves \mathbf> 5× faster forward speed than FlashAttention.Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing 1024×1024 high-resolution images to the 1s, without skipping the last scales.Compared with the VAR baseline accelerated by FlashAttention, our method achieves a \mathbf1.57× speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a \mathbf2.28× acceleration, while maintaining competitive visual generation quality.
Paperid: 3352,   Poster  
Authors: Ziyan He, Qiudan Zhang, Lin Ma, Xu Wang
Title: SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
Abstract: Panoramic depth estimation enables a complete 360^\circ understanding of 3D environments but faces significant challenges in generalizing to realworld scenes. While recent zero-shot depth models like Depth Anything achieve remarkable generalization on perspective images, their performance sharply degrades on panoramas due to projection distortions and the lack of spherical geometric awareness. Moreover, collecting large-scale panoramic RGB-D data is costly, hindering the large-scale training of panoramic foundation models. To address these issues, we propose an SO(3)-Equivariant ViT-Adapter, which transfers the powerful zero-shot capability of the perspective pre-trained ViT to panoramic depth estimation by explicitly incorporating a rotation-equivariant inductive bias. Our adapter introduces an SO(3) deformable cross-attention mechanism to effectively align SO(3)-equivariant features with perspective features, enhancing rotational consistency without modifying the ViT backbone. Trained solely on synthetic panoramas, our framework achieves robust zero-shot sim-to-real performance on real indoor benchmarks, including Matterport3D and Stanford2D3D, demonstrating both data efficiency and strong generalization for panoramic depth estimation.
Paperid: 3353,   Poster  
Authors: Pei Geng, Shanshan Zhang, Jian Yang
Title: CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
Abstract: Reconstructing 3D humanobject interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sample 3D vertex features from both views and introduce a bidirectional cross-view Transformer to integrate multi-view vertex representations for accurate contact estimation. Finally, the predicted contact maps are leveraged to refine human-object meshes, yielding geometrically consistent and physically plausible reconstructions.Experiments on BEHAVE and InterCap show that our proposed CrossHOI surpasses state-of-the-art methods in both reconstruction accuracy and contact prediction, especially under severe occlusions.
Paperid: 3354,   Poster  
Authors: Daikun Liu, Teng Wang, Changyin Sun
Title: Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
Abstract: Event cameras hold excellent dynamic properties, showing great potential for monocular depth estimation (MDE). However, existing methods mainly improve performance by optimizing contextual features, but still struggle with the illposed and nonlinear nature of direct full-depth regression. In this paper, we propose HypoDepth, the first event–image monocular depth iterative refinement framework. By introducing a discrete Depth Hypothesis Volume (DHV), we transform the depth regression problem into a constrained depth search task. Specifically, we construct a 3D cost volume between the DHV features and contextual features and perform a multi-scale correlation search to guide stable residual optimization. This lightweight cost volume enables efficient global-to-local refinement across multi-resolution. Our method outperforms existing approaches on DSEC and MVSEC with state-of-the-art results and strong zero-shot generalization. Meanwhile, our tiny model achieves an excellent balance between accuracy and efficiency, enabling real-time performance on resource-limited devices.
Paperid: 3355,   Poster  
Authors: Rui Wu, Shuo Zhang, Xiaoxuan Tang, Ruirui Zhang, Yi Liu, Tao Jiang, Wenhao Xu, Yong Li
Title: ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
Abstract: Multimodal Web Agents demonstrate a practically valuable capability by fusing information from diverse modalities (e.g., text and vision), retrieved iteratively from the internet, to respond to complex user queries.However, the visual modality is prone to information overload, and the noise contained within it—such as irrelevant background details or complex structures—can disrupt the model's attention, misdirecting its operational focus toward an erroneous path.To address the aforementioned challenge, we propose ReFAct (Reasoning, Focusing, and Acting), a novel framework that empowers the agent to actively manage its crossmodal context. This allows the agent to adjust its operational focus, thereby mitigating the impact of noise on multimodal Web Agents.Specifically, ReFAct employs a Grounding tool for active visual perception to dynamically filter information. We also design external memory-based Defocus/Refocus operations for selective information retention, further modulating information density within the multimodal context. Ultimately, this ensures the agent maintains focus during problem-solving.To evaluate and enhance agent capabilities in complex and noisy multimodal contexts, we first propose a pipeline for constructing datasets with flexible complexity. We introduce a new open-source benchmark: GroundedVQA. Finally, we experimentally demonstrate the effectiveness of our proposed method on GroundedVQA and other widely-used benchmarks.
Paperid: 3356,   Poster  
Authors: Mohammad Esfeh Esfeh, Qi Yan, Yongxing Zhang, Zahra Gholami, Renjie Liao, Purang Abolmaesumi
Title: Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature
Abstract: Modern vision systems are deployed in settings where occasional catastrophic failures matter more than average accuracy—for example in medical imaging, autonomous driving, and safety monitoring. While conformal prediction gives distributionfree uncertainty guarantees, most existing methods only control mean error and are hard to tune toward rare but high-cost mistakes. We propose Bayesian-Quadrature Spectral Risk Control (BQ-SRC), a general framework for controlling tail-focused risks (such as conditional value at risk (CVaR)-style objectives) in a distribution-free way. BQ-SRC views conformal prediction through a Bayesian-quadrature lens and replaces mean-risk control with a flexible family of risk-averse criteria, while keeping the same black-box access to a trained model. A binomial testing scheme reduces the Monte Carlo conservatism of prior approaches, leading to tighter sets without sacrificing guarantees. We evaluate BQ-SRC across diverse vision tasks, including synthetic regression, closed-set and zero-shot image classification, multilabel classification, and semantic segmentation. Across these settings, BQ-SRC consistently maintains finite-sample risk guarantees and often yields smaller or otherwise more informative prediction sets than existing conformal and risk-controlling baselines, sometimes trading a modest amount of efficiency for stronger tail-risk control. We will make our implementation publicly available upon acceptance.
Paperid: 3357,   Poster  
Authors: Seemandhar Jain, Keshav Gupta, Kunal Gupta, Manmohan Chandraker
Title: NERFIFY: Multi Agent Framework for Turning NeRF Papers into code
Abstract: The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multiagent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel regularizers or architectural optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (±0.5 dB PSNR, ±0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
Paperid: 3358,   Poster  
Authors: Tianlin Huo, Dongchuan Ran, Ranjie Duan, Yao Zhu, Peilun Du, ningbo yao, Huanqian Yan, Xu Han, Qiang Yun, Yuzheng Tan, baoyang baoyang, Yuan He
Title: Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
Abstract: Backdoor attacks pose potential threats to object detection models, highlighting the importance of studying their security. However, existing backdoor attacks mainly rely on triggerspecific intrinsic features, which limits their practicality in real-world scenarios. In this paper, we propose a novel backdoor attack that leverages dynamic object interactions in realistic scenarios to activate malicious behavior. By hijacking the Non-Maximum Suppression (NMS) process in object detectors, this attack demonstrates robust effectiveness, including misclassification, mislocalization, and object appearance/disappearance, while maintaining the model’s normal performance on clean inputs. Experimental results demonstrate that our attack exhibits significant attack performance across various object detectors and datasets, and remains effective both in physical environments and under existing defense mechanisms. These findings highlight the urgent need to develop efficient and robust defense strategies against backdoor attacks.
Paperid: 3359,   Poster  
Authors: Boya Shi, Naiyang Guan, Yi Xiaodong
Title: Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
Abstract: Existing dynamic scene rendering methods often adopt rigidbody or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under an explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically realistic rendering, especially near contacts, and substantially reduces motion-boundary artifacts.
Paperid: 3360,   Poster  
Authors: Hao Vo, Khoa Vo, Tran Phan Phan, Cuong Ngo, Gianfranco Doretto, Hien Van Nguyen, Anh Nguyen, Ngan Le
Title: SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
Abstract: Cameraonly 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.
Paperid: 3361,   Poster  
Authors: Menghao Zhang, Yiyan Zhu, Pengfei Ren, Haifeng Sun, Qi Qi, Zirui Zhuang, Huazheng Wang, Lei Zhang, Jianxin Liao, Jingyu Wang
Title: Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning
Abstract: In this paper, we explore video anomaly detection (VAD) from a finegrained perspective, which aims not only to detect anomalous events but also to identify their specific categories. Due to the limited number of examples per category, existing methods either fail to handle intra-class variation across diverse contexts or struggle with inter-class confusion caused by shared visual primitives. To address these challenges, we propose a progressive cross-granularity learning paradigm that leverages coarse- and fine-grained labels in a complementary manner to progressively refine representations from generic anomaly patterns to category-specific semantics.Building on this paradigm, we develop Fine-VAD, a progressive alignment framework that aligns video features with supervision signals at multiple granularities. Extensive experiments on two benchmark datasets demonstrate that Fine-VAD achieves up to a 48% improvement in fine-grained anomaly classification, while maintaining state-of-the-art performance in coarse-grained anomaly detection. Notably, our paradigm generalizes well across diverse model architectures, offering an adaptable and effective solution for real-world fine-grained VAD.
Paperid: 3362,   Poster  
Authors: Yue Wu, Feng Xiao, Yongzhe Yuan, Hao Li, Kaiyuan Feng, Maoguo Gong, Qiguang Miao, Wenping Ma
Title: MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
Abstract: Outlier rejection for correspondencebased point cloud registration confronts two fundamental challenges in real-world scenarios. First, low-overlap regions yield sparse and fragmented inlier distributions that are difficult to discover using conventional one-step global search strategies. Second, large-scale scenes present dense correspondence inputs that impose stringent requirements on the accuracy-efficiency trade-off of search algorithms. To this end, we propose a hierarchical multi-hop graph search framework that progressively refines correspondences to address these challenges. Our method constructs a compatibility graph with transformation-invariant embeddings to predict correspondence confidence, establishing the foundation for cluster-balanced seed sampling that ensures comprehensive coverage across fragmented regions. These strategically selected seeds subsequently drive hierarchical multi-hop expansion, progressively discovering inliers through multi-resolution graph layers while circumventing the high complexity of exhaustive global search. Finally, distribution-aware ranking jointly evaluates geometric consistency and spatial coverage to select well-distributed transformations from multiple hypotheses. Experiments on 3DMatch, 3DLoMatch, and KITTI demonstrate that our method significantly outperforms state-of-the-art methods in both low-overlap and large-scale scenarios.
Paperid: 3363,   Poster  
Authors: Jiayuan Chen, Ruoqi Liu, Zishan Gu, Ping Zhang
Title: Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics
Abstract: Microscopybased phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches.We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.
Paperid: 3364,   Poster  
Authors: Cheng Fang, Zimu Zhou, Ke Ma, Bin Guo
Title: TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
Abstract: Ondevice AI systems increasingly adopt a single foundation model equipped with task-specific Low-Rank Adaptation (LoRA) modules, forming a multi-LoRA LLM that supports multiple tasks.We study how to adapt such a model to a new task on memory-constrainted devices.Although LoRA reduces trainable parameters, fine-tuning a full set of modules remains memory-intensive.To improve efficiency, we apply sparse updating, training a subset of LoRA modules within the memory budget.However, existing sparse updating methods assume all candidate parameters are instantiated and cannot estimate the importance of modules that do not yet exist, while prior memory models designed for sequential networks fail to capture the blockwise parallel structure of Transformers.We propose TaskIT, a framework for memory-efficient fine-tuning via cross-task importance transfer. TaskIT predicts pre-insertion module importance by transferring from previously tuned tasks and employs a block-based memory predictor that captures parallel and sequential dependencies of Transformer blocks. A dynamic programming scheduler then selects module locations, numbers, and ranks to maximize accuracy within the memory budget.Experiments on uni-modal and cross-modal benchmarks show that TaskIT achieves superior accuracy-memory tradeoffs compared with Zero-FT, non-LoRA, and LoRA-based fine-tuning methods.
Paperid: 3365,   Poster  
Authors: Hao Zou, Runqing Zhang, Jin Ding, xue zhou, Jianxiao Zou, Mingzhu Cai
Title: Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
Abstract: Textto-Image Person Retrieval (TIPR) aims to retrieve pedestrian images with a given natural-language description. It remains highly challenging due to the inherent ambiguity in cross-modal alignment: existing models often struggle to capture fine-grained correspondences, and their understanding of detailed pedestrian attributes is typically confined to partial or coarse cues, leading to mismatched or erroneous retrieval results.To overcome this challenge, we propose CECA, a Conversation-Enhanced Cross-modal Alignment framework. CECA strengthens the attribute correspondence between textual and visual modalities through multimodal large language models (MLLMs)-guided dialogue, enhances detailed cross-modal matching via a Bidirectional Correlation Matching (BCM) mechanism, and stabilizes optimization with a Confidence-Aware Weighting Loss (CAWL) that reduces the impact of low-quality conversational responses. Extensive experiments on three public benchmarks demonstrate the superior performance and strong generalization ability of our approach.
Paperid: 3366,   Poster  
Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu
Title: FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction
Abstract: Current diffusionbased acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6× acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6× speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
Paperid: 3367,   Poster  
Authors: Zhichao Zeng, Jiasheng Zhang, Jiyun Sun, Jiangtao Cui, Xiaotian Qiao
Title: LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
Abstract: Visual anomaly detection is vital for quality control applications by identifying deviations from normal patterns.Previous structural or logical anomaly detection methods mainly focus on pixellevel deviations like texture defects and reconstruction errors, ignoring the object-level structural and contextual inconsistencies.These overlooked layout anomalies remain critical yet underexplored, e.g., factually defective hallucinations appeared in generative text-to-image models.Based on the above observation, in this paper, we introduce scene layout anomaly detection, a new task that predicts an object-level anomaly map from the input image to reveal the semantic plausibility and geometric consistency of each object in the scene.Specifically, we propose LayoutAD, an unsupervised learning framework that constructs semantic and geometric graphs to jointly reason over semantic-geometric misalignment among objects.Under this formulation, we are able to detect diverse layout deviations, including object attribute implausibilities and relationship mismatches.Extensive experiments show that LayoutAD outperforms baselines qualitatively and quantitatively across scenarios, benefiting scene understanding and generation applications, including self-corrected image generation and video anomaly detection.
Paperid: 3368,   Poster  
Authors: Choo Sin Wai, Bo Li
Title: Scalable Feature Matching via State Space Modeling and Sparse Correlation
Abstract: Efficient and robust feature matching is crucial for latencysensitive and resource-constrained applications. While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpasses expectation-based regression in subpixel accuracy. Experimental results demonstrate our method's superior accuracy and efficiency performance over state-of-the-art (SOTA) approaches on both indoor and outdoor datasets. Notably, in resolution scaling tests, our method achieves 45% lower memory usage and 2.4× faster inference than JamMa, while also outperforming Efficient LoFTR with 57% memory reduction and 1.8× speedup at high resolution. These results demonstrate the strong scalability and practical efficiency of our method.
Paperid: 3369,   Poster  
Authors: Qiang He, Yaozong Yang, KAIBIN WANG, Ziteng Wei, Feifei Chen, Caslon Chua, Yun Yang
Title: LoPrune: Efficient Data Pruning for LoRA-based Fine-Tuning of Vision Transformers
Abstract: Visual models are deployed on many Internetof-Things (IoT) devices to power a variety of visual applications at the network edge. These models often need to be fine-tuned on-device continually to adapt to changing operating environments timely. However, the computing and energy overheads incurred are often overwhelming for resource-constrained IoT devices. Existing methods score sample importance based on the entire model via multi-epoch training, incurring overhead that may even exceed the training overhead reduction. To reduce these fine-tuning overheads, this paper presents LoPrune, a novel data pruning method that identifies and removes samples with negligible contributions to model adaptation. The key idea is to evaluate sample importance via a Trainable Subspace Alignment (TSA) Score to align the importance estimation with accurate update directions of the learnable adapter, i.e., Low-Rank Adaptation (LoRA). Specifically, LoPrune projects the influence function onto the LoRA subspace, enforcing consistency between the importance score and the model’s updatable directions while substantially reducing the problem’s dimensionality. It then leverages Kronecker-Factored Approximate Curvature to approximate the change of learnable adapter induced by a sample as its TSA score, retaining higher-scoring samples. Experiments with four representative visual models fine-tuned on three datasets demonstrate that compared with the best state-of-the-art data pruning baselines, LoPrune can reduce fine-tuning overhead by up to 72.9%, achieving a 3.69 × training speedup while improving fine-tuning accuracy by 3.50%.
Paperid: 3370,   Poster  
Authors: Zejian Li, Jiarui Ma, Han Xu, Weiting Zheng, Yangrui Zhu, Chenye Meng, Pei Chen, Ling Yang, Zhiyuan Yang, Changyuan Yang, Guang Yang, Immanuel Koh, Lingyun Sun
Title: Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop
Abstract: Multistage generative models have shown great promise in 3D content creation due to focused generation of structure or texture in different stages, but their outputs often fail to align with human preferences. The key bottleneck to apply alignment methods is the presence of non-differentiable operations between generative stages.This disconnection stops preference signals applied to the final output from being backpropagated to the crucial, early stages of generation, while simple separated stage-wise alignment leads to texture-geometry inconsistency.To address this challenge, we introduce Circular-DPO, which builds a preference feedback loop to align multi-stage 3D generation models to human preference.Our method first applies Direct Preference Optimization (DPO) to refine the final 3D asset.We then construct new preference pairs by sampling and decoding the assets generated by the optimized model.These newly-formed pairs are used to train the preceding generative stage, effectively creating a feedback loop that bridges the non-differentiable gap. Furthermore, to enhance robustness against noisy data, we introduce a quality-aware weighting mechanism that prioritizes reliable preference pairs during training. Experiments demonstrate that our approach improves the alignment of generated 3D content with human preferences by enabling holistic, multi-stage optimization.
Paperid: 3371,   Poster  
Authors: Di Yang, Mahmoud Ahmed Mohamed ALI, Xuanlong Yu, Xi Shen, Quan Kong, Gianpiero Francesca, Francois Bremond
Title: MoVie: Broaden Your Views with Human Motion for Action Detection
Abstract: Human action detection in videos requires both semantic recognition and accurate modeling of motion. While recent video foundation models have advanced visual semantics, they still struggle to capture complex and compositional actions due to the limited representation ability of motion. Human skeleton sequences, which explicitly describe the body structure and movement, provide valuable physical and geometric motions that complement RGB videos. However, combining video and skeleton modalities faces two key challenges: (i) labeldriven skeleton features are too coarse to describe fine-grained motion, and (ii) skeleton motion and RGB video lie in heterogeneous feature spaces, so current fusion strategies often cause feature interference. To address these, we propose MoVie, a unified Motion-Video processing framework that uses structured human motion as a bridge between the two signals. We first propose a Structural Motion Projection module that decomposes motion into primitive components using a learnable motion dictionary, to produce fine-grained descriptors. Then, we design a Motion-guided Feature Regularization mechanism that aligns visual features with motion through an orthogonality-based transformation, so that fine-grained motion cues can guide visual representations without collapsing semantic diversity. Extensive evaluations on Toyota Smarthome Untrimmed, Charades, Multi-THUMOS and PKU-MMD datasets demonstrate that MoVie significantly improves state-of-the-art action detection performance.
Paperid: 3372,   Poster  
Authors: Jiayang Wu, Xinyang Chen, Ke Lv, Weili Guan
Title: Boosting Visual Reprogramming for Vision-Language Models with Dual Granularity Alignment
Abstract: Model reprogramming adapts pretrained models to downstream tasks by modifying their input and output spaces. Visual reprogramming (VR), a prominent instance, introduces learnable input transformations (e.g., visual prompts) to repurpose visionlanguage models like CLIP for downstream visual tasks. Existing VR methods primarily focus on single-level alignment between prompted images and text descriptions, overlooking inherent structural information in data that facilitates alignment: semantic granularity (label hierarchies) and visual granularity (multi-scale representations). To address this gap, we propose Dual Granularity Alignment (DGA): First, for visual granularity, we generate multi-scale images and propose Uncertainty-calibrated Prediction Fusion (UPF) to capture hierarchical spatial information within images. Second, for semantic granularity, we propose Prototype-guided Label Hierarchization (PLF) to construct category hierarchies from visual semantic similarities and propose Hierarchical Knowledge Propagation (HKP) to achieve top-down superclass-to-subclass guidance for coherent multi-level visual prompts alignment. Our DGA collaboratively integrate both granularities to enhance alignment effectiveness. Experiments across 12 downstream datasets demonstrate DGA's superiority over baselines on both ViT-based and ResNet-based CLIP architectures. Specifically, DGA achieves a 4.5% improvement over the previous state-of-the-art method on ViT-16-based CLIP. By explicitly modeling structural granularities, DGA establishes a new paradigm for visual reprogramming.
Paperid: 3373,   Poster  
Authors: Jianibieke Adalibieke, Qianwei Han, Xueyi Liu, Yuzhe Qin, Li Yi
Title: AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
Abstract: Language is a natural way to command robots, but converting a single instruction into a longhorizon, contact-rich hand–object interaction remains challenging: synthesized references are noisy, human-to-robot retargeting introduces embodiment bias, and fixed-reference tracking lets small errors snowball. We address this with AdaDexTrack, a modulator-in-the-loop framework for language-conditioned manipulation tracking. A distilled generalist tracker serves as the skill carrier, while a tightly aligned modulator performs three feedback corrections: reference modulation (continual adjustment of what to track), object-latent modulation (online adaptation of the object representation to recruit suitable skills), and positional-target modulation (small state-dependent refinements for execution). The tracker is learned via large-scale specialist to generalist distillation on a corpus of language-conditioned hand–object trajectories; the modulator is trained with RL under the same task objective, ensuring tight coupling. Across large-scale evaluations, AdaDexTrack consistently outperforms prior SOTA on unseen-trajectory and unseen-object sets in both average tracking error and success rate, demonstrating robustness and generalization. We further show zero-shot sim-to-real transfer on real hardware, where adding the modulator yields substantial gains over a tracker-only variant. AdaDexTrack reframes language-conditioned dexterous manipulation as modulated tracking, replacing the open-loop, fixed-reference tracking with in-loop modulation that adjusts the reference, object latent, and positional target, yielding drift-resistant execution from noisy text references.
Paperid: 3374,   Poster  
Authors: Zhenwu Shi, Jingyu Gong, Peiwei Wang, Xingzan Wang, Tianwen Qian, Wenxi Li, Yuan Fang, Jiao Xie, Lizhuang Ma, Shaohui Lin
Title: Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning
Abstract: Textbased human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on subtle variations accoding to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate thatOmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. The code will be made publicly available uppon acceptance.
Paperid: 3375,   Poster  
Authors: Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Ali Asgarov, Chia-Wei Tang, Alvi Md Ishmam, Chris Thomas
Title: Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
Abstract: Finetuning has become the default way to adapt powerful foundation models, but this also enables low-cost repurposing for harmful objectives. Existing immunization methods try to optimize local geometry or simulate short attacker horizons, and penalize observed loss drops. However, in practice, downstream tuners run thousands of updates and overcome these short-horizon defenses.In this paper, we propose CLAMP (Contractive Long-horizon Attacker Mitigation via Progress-bounding), an immunization method that traps harmful fine-tuning by shaping the attacker's optimization dynamics rather than only the initial landscape. Our key idea is to make harmful training locally contractive, making each update smaller than the last. This yields a closed-form bound on the attacker's training beyond the attacker's simulated training steps. We also introduce a Hessian-free directional curvature penalty, to create adversarial landscapes along harmful descent directions. Our bi-level objective minimizes the attacker's predicted improvement from train step zero to infinity. Experiments show our method withstands long-horizon fine-tuning across classification, generative, and autoregressive settings, substantially reduces harmful task adaptation, while preserving benign utility and fine-tuneability.
Paperid: 3376,   Poster  
Authors: haowen gu, Gensheng Pei, Zeren Sun, Mingwu Ren, Xiangbo Shu, Yazhou Yao, Fumin Shen
Title: MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA
Abstract: Medical Visual Question Answering (MedVQA) holds significant promise for clinical decision support, yet faces challenges due to limited annotated data and the high computational demands of existing large vision-language models. We propose MedFG-VQA, a lightweight framework that leverages a memory bank to augment DCT-based low-frequency features and employs graph-enhanced cross-attention for effective visual-textual alignment. Specifically, our approach features two key components: Frequency-Memory Fusion (FMF), which enhances low-frequency features by retrieving from a learnable memory bank built on DCT decomposition, and Graph-Aware Cross-Attention (GACA), which aligns visual-textual features via cross-attention and refines them through graph-convolutional aggregation. To address data scarcity, we construct SynMed-VQA, a large-scale synthetic dataset comprising over 2 million question-answer pairs across 9 imaging modalities and 10 major organs, generated with GPT-4o. Extensive experiments on SynMed-VQA and three other standard biomedical VQA benchmarks demonstrate that MedFG-VQA achieves competitive or superior performance compared to much larger models while maintaining significantly lower computational costs, highlighting its efficiency and potential for clinical deployment.
Paperid: 3377,   Poster  
Authors: Chaoyu Gong, Han Zhang, Siqiang Luo
Title: DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection
Abstract: A fundamental yet overlooked limitation of current deepfake detection benchmark is the lack of evaluation frameworks that align technical accuracy with realworld impact. We argue that technical metrics may fail to capture models' actual capacity to mitigate real-world harm, as they treat all errors as equally significant. To bridge this gap, we introduce DeepfakeImpact, a two-stage benchmark that moves beyond pure technical evaluation toward societally-aware assessment. In Stage I, we establish standardized technical baselines by evaluating 33 SOTA detection baslines across 12 widely used datasets. In Stage II, we propose a novel metric (Social Misjudgment Impact, SMI) that quantifies the potential social harm of misclassified videos, and construct a SMI-critical dataset containing high-risk samples. By integrating SMI-aware performance metrics, we shift the evaluation focus from "how accurate'' to "how socially beneficial'' a detector is. DeepfakeImpact thus provides a more realistic and ethically-grounded foundation for assessing deepfake detectors, urging the community to rethink what truly constitutes progress in this field. All resources will be publicly released at: \urlhttps://anonymous.4open.science/r/DeepfakeImpact-Stage1-F5EC.
Paperid: 3378,   Poster  
Authors: Brandon Zhao, Diana Scognamiglio, Olivier Doré, Katie Bouman
Title: Generative Diffusion Priors for 3D Mapping of the Dark Universe
Abstract: Reconstructing the threedimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset \textttConicus3D, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.
Paperid: 3379,   Poster  
Authors: You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang
Title: FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts
Abstract: Recent Videoto-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded/partially visible objects.In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts(STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability.To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSound-Director and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable.
Paperid: 3380,   Poster  
Authors: Hongjie Li, Heng Yu, Jiaman Li, Hong-Xing Yu, Ehsan Adeli, Karen Liu, Jiajun Wu
Title: AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
Abstract: Reconstructing 3D human motion and humanobject interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on these domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
Paperid: 3381,   Poster  
Authors: Rui Zhang, Yaqi Wang, Yadong Li, Ruixu Geng, Jianyang Wang, Qijun Ying, Dongheng Zhang, Yang Hu, Yan Chen
Title: VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification
Abstract: Person reidentification (ReID) is critical for public safety, yet the performance of RGB-based methods is limited under challenging lighting and occlusion conditions. In contrast, low-frequency radio frequency (RF) signals, with their superior penetration capability and illumination invariance, provide ideal complementary information. However, a key challenge in fusing these heterogeneous modalities lies in the conventional approach that relies heavily on cross‑modal distribution matching, which often over‑regularizes and weakens the discriminative capacity within each modality. Rather than enforcing direct distribution alignment, canonical correlation analysis (CCA) constructs a shared subspace that maximizes cross‑modal correlation, inherently balancing modality specificity and shared semantics. Inspired by this, we reformulate cross-modal alignment as a correlation maximization problem, avoiding direct constraints on feature distributions and guiding the model to harmonize intra‑modal discriminative learning with cross‑modal alignment. Specifically, VRCLIP first refines CLIP’s visual encoder with illumination‑disentangling objectives, then aligns RGB and RF embeddings in a canonical correlation subspace, and finally employs an RF‑anchored reliability gate for adaptive fusion. To advance the area, we will release VRR, the first large‑scale vision–radio ReID dataset with over 650K paired image–radar samples and position annotations for 31 participants. Extensive experiments show state‑of‑the‑art 93.9% mAP and robust generalization across diverse lighting and occlusion conditions.
Paperid: 3382,   Poster  
Authors: Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
Title: Monet: Reasoning in Latent Visual Space Beyond Image and Language
Abstract: Thinking with images has emerged as an effective paradigm for advancing visual reasoning, extending beyond textonly chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning—high computational cost in latent–vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates.To support SFT, we construct Monet-SFT-125K, a high-quality text–image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning.
Paperid: 3383,   Poster  
Authors: Shiman He, Nuo Chen, Xinyi Ying, Yihang Luo, Yangsi Shi, Zaiping Lin, Miao Li
Title: Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection
Abstract: Small object detection (SOD) plays a vital role in applications such as antiUAV tasks, yet conventional image-based methods struggle in high-speed scenarios due to the limited frame rate. Event cameras offer a promising alternative by capturing spatiotemporal event streams with microsecond-level temporal resolution. To address the inherent sparsity of small objects in event data, existing methods typically formulate the detection task as semantic segmentation on spatiotemporal point clouds to leverage long-term contextual information. However, these methods often fail to enforce effective spatiotemporal consistency constraints, resulting in fragmented object trajectories. To mitigate these problems, we propose a topology-constrained sparse convolutional network (SpTopoNet), which models the topological structure of moving object trajectories in event point clouds. Our network comprises two key components: a Topology Learning Module (TLM) that discriminates local structures to separate genuine targets from noise, and a Spatial Consistency Module (SCM) that captures long-range spatiotemporal dependencies to enhance trajectory continuity. Additionally, we introduce an event topology-aware loss function that leverages topological correlations to guide the network to maintain structural integrity of target event patterns.Experiments on the benchmark dataset demonstrate the superiority of our method in both detection performance and trajectory completeness.
Paperid: 3384,   Poster  
Authors: Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi
Title: AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
Abstract: Adapting visionlanguage models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference.Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.
Paperid: 3385,   Poster  
Authors: Chang Su, Beihong Jin, Qiwen Shi, Zhi Wang
Title: mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
Abstract: Millimeterwave (mmWave) point clouds have attracted growing interest in human sensing due to their robustness, privacy preservation, and low cost. However, their practical use is hindered by the inherent sparsity of data and the lack of large-scale data. We revisit generative modeling for mmWave point clouds and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation by learning an invertible transport between dense and sparse point clouds. We leverage paired data and a latent-alignment module to enforce semantic alignment and bridge the modality gap. We find that condition-free flow matching is more vulnerable to latent path crossings, which impair bidirectional transport. Therefore, we propose Origin-Aware Flow Matching (OA-Flow), which conditioning transport on the origin of the path mitigates ambiguity in bidirectional transport. Results of experiments across multiple datasets demonstrate the effectiveness of mmWaveFlow for mmWave human point clouds generation and enhancement. We also observe consistent gains in downstream tasks, highlighting the promise of our framework for human sensing. We will release the code.
Paperid: 3386,   Poster  
Authors: Wenhan Lv, Shaopan Wang, Xiangyu Wu, Tianchu Hang, Zhongquan Jian, Qingqiang Wu
Title: Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation
Abstract: Textto-motion generation aims to synthesize realistic and semantically aligned 3D human motions from natural language descriptions. However, existing diffusion-based methods often rely on isotropic latent priors and shallow cross-modal supervision, which lead to semantic entanglement, limited controllability, and poor interpretability.We propose HESP, a unified diffusion framework that hierarchically enhances semantic priors for disentangled text-driven motion generation. At its core, HESP introduces an Adaptive Gaussian Variational Autoencoder (AG-VAE) that structures the latent motion manifold into multiple semantically coherent submanifolds, enabling interpretable and controllable motion representations. To further bridge linguistic and kinematic semantics, we design a Dynamic Cross-Modal Memory (DCMM) module for adaptive semantic fusion and a Hierarchical Cross-Modal Attention (HCA) mechanism to capture multi-level text–motion correspondences.Extensive experiments on HumanML3D and KIT-ML demonstrate that HESP consistently outperforms state-of-the-art baselines such as SALAD, MoMask, and MDM, achieving improvement, while maintaining higher diversity and physical plausibility. Moreover, the structured latent space of HESP provides interpretable clusters that reveal clear semantic boundaries among different motion categories.Our work establishes a new paradigm for text-conditioned human motion generation by integrating hierarchical latent modeling with adaptive cross-modal reasoning, advancing both performance and interpretability.
Paperid: 3387,   Poster  
Authors: Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan
Title: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
Abstract: Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer.Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct finegrained detail.Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model.Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(\method) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. \method substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. Codes will be released.
Paperid: 3388,   Poster  
Authors: Mu Zhang, Tianren Ma, Yunfan Liu, Kun Hu, Qixiang Ye
Title: RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits
Abstract: Discrete Diffusion Models (DDMs) have shown great potential in image generation, especially when equipped with reinforcement learning (RL) techniques.However, a fundamental yet overlooked limitation is revealed in our experiments: severe imbalance of credit assignment across timesteps during training. As a result, early generation timesteps, which carry higher exploration potential and determine the global structure, provide a smaller contribution to policy optimization.To conquer this, we propose a simpleyet-effective approach, to Re-balance timestep credit of Reinforcement Learning (RebRL) for better exploration-exploitation trade-off and more efficient training of DDMs.RebRL is plug-and-play\textemdash simply replacing uniform temporal policy with strategic rebalancing along masking stages. RebRL is analytically plausible\textemdashderivation and analysis show that it enjoys a uniform token-level policy gradient, which benefits policy optimization.Experiments on text-to-image generation benchmarks show that RebRL achieves state-of-the-art performance on GenEval and improves human preference score by up to 3.40 while effectively reducing training steps by ~40%.Code is enclosed in the supplementary material.
Paperid: 3389,   Poster  
Authors: Rui Chen, Jianfeng Zhang, Jing Lin, Xuanyu Yi, Yixun Liang, Guan Luo, Xiu Li, Zeming Li, Ping Tan
Title: ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
Abstract: Singleimage-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose Visibility Learning, a training paradigm that injects visibility structure and positional inductive bias into the image-to-3D pipeline. Our method comprises two synergistic components: (1) Visibility Grouping (VG), which explicitly partitions VecSet tokens into visible and invisible subsets by exploiting the spatial locality of VecSet VAE decoders; and (2) Visibility-Aware Positional Encoding (VAPE), which assigns shared positional embeddings to image tokens and visible shape tokens to amplify their correspondence, while using distinct encodings for invisible tokens to guide hallucination. By explicitly disentangling visible reconstruction from invisible hallucination, our approach shrinks the effective hypothesis space and provides clear structural guidance for diffusion models. Extensive experiments demonstrate that Visibility Learning accelerates training convergence by up to \textcolorred4.4× while achieving superior generation quality compared to strong VecSet-based baselines.
Paperid: 3390,   Poster  
Authors: Nan Li, Yike Zeng, Qian Zhang, Qi Zhang, Zhiyi Pan, Wei Feng, Liang Wan
Title: Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic realtime rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-level distortions. To achieve efficient control, W2M operates on a structured 3DGS backbone organized around learnable anchors and applies policy-guided per-anchor gradient scaling. Extensive experiments across the Blender, LLFF, and Mip-NeRF 360 datasets demonstrate that W2M achieves state-of-the-art bit accuracy, strong perceptual fidelity, and structural consistency under both standard and adversarial conditions.
Paperid: 3391,   Poster  
Authors: Haobo Jiang, Liang Yu, Jianmin Zheng
Title: GM-R$^2$: Generative Matching Learning for Unsupervised Geometric Representation and Registration
Abstract: This paper proposes GMR^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point cloud-to-depth mapping that adaptively zooms into the angular region occupies by the narrow-FOV input for dense range-map acquisition. Extensive experiments on 3DMatch and ScanNet datasets verify the superior precision of our GM-R^2, even surpassing supervised methods.
Paperid: 3392,   Poster  
Authors: Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
Title: tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Abstract: We propose tttLRM, a novel large 3D reconstruction model that leverages a TestTime Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
Paperid: 3393,   Poster  
Authors: Wenfeng Song, Xuehan Wang, Shuai Li, Yi Chen, Yuting Guo, Zhenyu Wu, Xingliang Jin, Chenglizhao Chen, Fei Hou, Hongyu Wu, Aimin Hao
Title: MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
Abstract: Diffusionbased motion generation has advanced rapidly, but current methods still struggle with long-horizon consistency, style control, and multi-condition guidance. A major reason is the fused-conditioning design, where semantic, stylistic, and temporal signals share a single pathway, causing interference and limiting controllability.We propose MoCoDiff, a controlable autoregressive diffusion framework that introduces Injection Modulation Controllers (IMC). IMC is a lightweight, modality-specific linear modulation modules that inject text, style, and history signals through separate conditioning paths. IMC preserves the simplicity of a frozen backbone while avoiding the entanglement inherent to fused conditioning, enabling more stable and interpretable multi-condition control.To further enhance long-range synthesis, we develop a controllable autoregressive diffusion model equipped with Temporal IMC (TIMC), which applies history as a timestep-dependent corrective signal. This controllable formulation actively suppresses drift, enforces smooth transitions across motion segments, and significantly improves temporal coherence over extended sequences.Experiments show that MoCoDiff achieves state-of-the-art style fidelity, transition quality, and efficiency, while supporting flexible and interpretable multi-condition motion synthesis without retraining.
Paperid: 3394,   Poster  
Authors: Eungi Lee, Seung-hyeok Back, Hyung-Il Kim, Seok Bong Yoo
Title: DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
Abstract: Faceswapping deepfakes allow realistic identity transfer, which can serve creative purposes but increases the risk of identity abuse. A proactive defense aims to prevent deepfake creation by obstructing identity feature extraction from input images, essential for identity-driven face-swapping. Existing proactive defense approaches aim to protect faces by hindering accurate identity feature extraction, but tend to introduce visible artifacts and fail to degrade the visual quality of the face-swapping deepfakes. This work proposes a proactive face-swapping defense using identity blending and attribute distortion (DeepProtect) that integrates global identity fusion in the latent space and local prompt-driven adversarial watermarking to address these problems. This work dilutes distinct identity representations by channel-wise blending of multiple identities in the latent space and optimizing the generator for visual consistency. The proposed approach distorts facial components in the identity space, directly influencing how faces are reconstructed in deepfakes. This approach applies semantic directions derived from user-provided text prompts to embed imperceptible adversarial watermarks that selectively distort facial attributes, affecting the visual fidelity of deepfake results. The proposed method hinders face-swapping deepfakes while preserving the perceptual quality of the protected images, offering a robust and practical solution for facial privacy protection. The experimental results reveal that DeepProtect effectively defends against face-swapping deepfakes while preserving visual consistency.
Paperid: 3395,   Poster  
Authors: Viktoria Ehm, Dongliang Cao, Riccardo Marin, Daniel Scholz, Weikang Wang, Florian Bernard, Daniel Cremers
Title: Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
Abstract: Partial shape matching is a crucial yet underexplored problem in 3D vision, with significant relevance to realworld scenarios where shapes are often only partially observed. Existing feature descriptors face difficulties in this setting, as traditional representations either struggle with the boundaries of partial shapes or heavily depend on the shape's spatial position. While existing approaches have employed DINO features for partial shape matching, these features are not inherently suited for handling partial observations. In this work, we propose a method to refine DINO features using LoRA-based self-supervised learning, enabling the generation of feature descriptors that are robust to partiality. Our features substantially improve performance on partial shape matching compared to traditional or vision foundation features. Additionally, when integrated into existing partial shape matching pipelines, we achieve state-of-the-art results on partial shape matching and left-right prediction benchmarks.
Paperid: 3396,   Poster  
Authors: Sosuke Yamao, Natsuki Miyahara, Yuankai Qi, Shun Takeuchi
Title: Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
Abstract: In the context of longterm video understanding with large multimodal models, many frameworks have been proposed.While transformer-based visual compressors and memory-augmented approaches are often used to process long videos, yet they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead aims to establish a feedback-driven process in which past visual contexts stored in memory can benefit ongoing perception.To this end,we propose Question-guided Visual Compression with Memory Feedback (QViC-MF),a framework for long-term video understanding.At its core is a Question-guided Multimodal Selective Attention (QMSA),which learns to preserve visual information related to the given question from both the current clip and the past related frames in memories. The compressor and memory-feedback works iteratively for each clip of the entire video.This simple yet effective design yields large performance gains on long-term video understanding tasks. Extensive experiments on four benchmarks demonstrate that our method achieves significant improvement over currentstate-of-the-art methods by 6.1% on MLVU-test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long.
Paperid: 3397,   Poster  
Authors: Xiao Liu, Shiwei Gan, Yafeng Yin, Bowen Guo, Zhiwei Jiang, Shunmei Meng, Lei Xie, Sanglu Lu
Title: SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production
Abstract: Sign language production aims to generate sign sequences from spoken language, where the generation of sign pose sequences from text is often treated as a significant task. However, due to the differences in grammatical rules and modalities between sign language pose sequences and spoken language text, it is rather challenging to convert text into sign poses (\ie, Text2Pose), while maintaining semantic consistency, motion accuracy and temporal coherence.In this paper, we focus on the Text2Pose task, and propose SignPR, a progressive diffusion framework that jointly models the structural and temporal properties of signing. Structurally, we perform progressive structural refinement: a structural VQVAE encodes each frame into semanticaware and region-based discrete representations; the diffusion process first produces semantically consistent poses and then progressively refines motion details under text and semantic conditioning. Temporally, we introduce block-wise causal diffusion, which progressively enforces temporal coherence and enables iterative refinement to earlier generated segments, yielding smoother transitions and reduced jitter. Extensive experiments on widely used datasets demonstrate that SignPR achieves superior results compared with prior T2P methods across multiple metrics, producing pose sequences that are semantically faithful, motion-accurate, and temporally coherent.
Paperid: 3398,   Poster  
Authors: Jingjing Zheng, Anda Tang, Qiangqiang Mao, Zhouchen Lin, Yankai Cao
Title: ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
Abstract: Tensor–based finetuning has attracted growing interest for its ability to reduce trainable parameters beyond matrix-based approaches such as LoRA and PiSSA, while capturing inter-layer correlations within networks. However, existing tensor-based methods typically require repeated reconstruction of model weights during training, leading to substantial computational and memory overhead. To overcome these limitations, we propose Reconstruction-Free Tensor-Based Adaptation (ReFTA), which offers four key advantages: (1) it eliminates repeated explicit tensor reconstruction by exploiting the algebraic properties of tensors; (2) it achieves lower quantization error by fine-tuning only the principal tensor components; (3) it is supported by a rigorous generalization guarantee rooted in the algebraic foundations of tensor product–based approaches; and (4) it adopts a unified design controlled by a single tensor rank configuration. Extensive experiments on both image classification (IC) and natural language understanding (NLU) tasks demonstrate that ReFTA achieves the best accuracy–efficiency trade-off among all evaluated methods. Across most cases, ReFTA attains the highest average accuracy with the fewest trainable parameters. On NLU tasks with RoBERTa-Large, ReFTA improves the average accuracy by approximately 5% over most existing methods while using only 86.4% fewer parameters than LoRA (r=1) and 97.5% fewer than PiSSA. In particular, ReFTA achieves substantially lower peak GPU memory consumption, reducing usage by over 30% compared with tensor-based baselines on the RTE dataset and demonstrating markedly improved scalability.
Paperid: 3399,   Poster  
Authors: Qilin Huang, Quynh Anh Huynh, Long Le, Chen Wang, Chuhao Chen, Ryan Lucas, Eric Eaton, Lingjie Liu
Title: UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
Abstract: Recent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow testtime optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single visual input, capturing inherent physical ambiguity. In addition, UniPixie is the first unified architecture to generate simulation-ready parameters for multiple physics solvers, including Material Point Method (MPM), Linear Blend Skinning (LBS), and Spring-Mass systems. Trained on our new PIXIEMULTIVERSE dataset of annotated material ranges, UniPixie produces diverse, physically consistent dynamics and achieves state-of-the-art accuracy, outperforming deterministic baselines by over 2x while inheriting the fast and generalizable inference from the prior feed-forward work.
Paperid: 3400,   Poster  
Authors: Mikhail Kennerley, Angelica I Aviles-Rivero, Carola-Bibiane Schönlieb, Robby T. Tan
Title: Mind the Gap: Transferring Labels to Align Object Detection Datasets
Abstract: Combining multiple object detection datasets offers a path to improved model generalisation but is hindered by inconsistencies in class semantics and bounding box annotations.cSome methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose LabelAligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features and address pseudo-label noise, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual re-annotation. Across multiple benchmarks, LAT demonstrates consistent improvements in detection performance, achieving gains of up to +8.4 AP over baseline methods.
Paperid: 3401,   Poster  
Authors: Panjun Liu, Jiyuan Xia, YUANSHEN GUAN, Yong Li, Zhiqiang Lang, Ruikang Xu, Chang Chen, Dehua Song, Fenglong Song, Zhiwei Xiong
Title: RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
Abstract: Extreme lowlight Raw image restoration remains challenging due to overwhelming noise and severe detail loss.In this paper, we exploit the potential of the dual-exposure setting for this severely ill-posed problem.Existing methods suffer from unreliable cross-exposure alignment, resulting in degraded detail recovery and compromised color fidelity. To address these challenges, we propose RawMetaDiff, a novel generative diffusion framework that restores a high-fidelity Raw image from a short-exposure input, conditioned on a potentially misaligned long-exposure reference under the guidance of Raw metadata.At its core, we propsed two complementary mechanisms: the Meta-Assistant Color Transfer (MACT) enforces color consistency by aligning global color statistics along the channel dimension,while the Meta-Normed Cross Attention (MNCA) leverages Raw metadata to establish robust cross-exposure spatial correspondences and inject shadow details.To support robust diffusion training, we first collect a 1K real-world, dual-exposure Raw dataset, namely DERaw, and then design a realistic degradation model to synthesize data that closely approximates real-world conditions.Extensive experiments on both synthetic and real-world datasets demonstrate that RawMetaDiff significantly outperforms existing methods, justifying an effective new solution for extreme low-light Raw image restoration from the generative perspective.
Paperid: 3402,   Poster  
Authors: rui hu, Song Wu, Wen Yang, Jinjian Wu
Title: From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation
Abstract: Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Eventbased cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of dense annotations restricts the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion.To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization.Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.
Paperid: 3403,   Poster  
Authors: Min Yang, Xinwen Zhang, Jialei Tang, Xin Zhou, Kehan Li, Zeyi Huang, Limin Wang
Title: VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
Abstract: With the great advancement of video generation models, a growing number of content creators and researchers are leveraging these technologies to produce large volumes of humancentric videos for content creation and customized data generation for specific tasks. Although existing video generation models are capable of producing videos with high visual quality, their inadequate understanding of video realism results in generating unrealistic videos. While various evaluators have emerged to assess the quality of generated videos, they are trained from low-quality generated videos and data annotations, leading to misaligned ratings with human preferences. They also lack interpretability due to the absence of chain-of-thought reasoning. To address these issues, we propose VideoRealBench, a comprehensive benchmark for evaluating the realism of generated human-centric videos. We leverage a rating scale designed from human preferences to score videos and provide three-step rationales, thereby creating a finely-annotated dataset VideoRealDataset and proposing an evaluator VideoRealEval capable of providing reliable scores along with detailed rationales. VideoRealEval achieves a Pearson’s linear correlation coefficient (PLCC) of 57.07% and a Spearman’s rank correlation coefficient (SROCC) of 56.78% on VideoRealDataset, demonstrating closer alignment with human preferences than existing evaluators.
Paperid: 3404,   Poster  
Authors: Jiaxing Li, Jiepeng Wang, Junyao Gao, Yang Liu, Eric Li, Bo An, Hao-Xiang Guo
Title: DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization
Abstract: Despite significant progress in textto-video generation, current models still suffer from unrealistic dynamics, temporal inconsistency, and unstable semantic alignment. Existing preference alignment approaches rely on costly and often ambiguous human or VLM-based video preference annotation, which has become a major bottleneck for scaling data. To address this challenge, we propose an annotation-free preference alignment method that constructs accurate preference pairs through video continuation.We extend a pretrained video generation model into a continuation model and apply continuation with different amounts of reference frames while keeping the total video length fixed. As generated segments are inferior to ground-truth frames and and fixed-length continuations conditioned on more reference frames contain less generated content, they exhibit higher fidelity than those with fewer references, naturally inducing a preference order.We further introduce Asymmetrical DPO, which computes preference loss on all continuation regions except the shared prefix conditioning frames and normalizes it by their length, preventing spurious preference signals from leaking into the conditioned portion.Experiments across multiple benchmarks show that our method delivers significant improvements in dynamics realism, temporal coherence, and semantic alignment over existing DPO-based approaches, while fully eliminating the need for human preference labeling or auxiliary reward models.
Paperid: 3405,   Poster  
Authors: Yining Pan, Shijie Li, Yuchen Wu, Xulei Yang, Na Zhao
Title: PanDA: Panoptic Domain Adaptation for Multimodal Perception in Autonomous Driving
Abstract: This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with a mm-3DPS backbone. However, existing mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g. poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we proposePanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.
Paperid: 3406,   Poster  
Authors: Ian Noronha, Heather Neave, Upinder Kaur
Title: MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Abstract: Understanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multientity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols. MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behaviors, 39 body keypoints across 157 test sessions, 4 spatial zones, and 43 subjects, describing interactions among subjects, objects, humans, and other cattle. We establish three benchmarks on MooCap: (1) dense temporal action segmentation over 1200-1500-second sequences; (2) pose-based behavior and interaction recognition from keypoint trajectories; and (3) longitudinal behavioral classification linking adult behaviors with rearing conditions. Benchmarking results reveal that state-of-the-art temporal segmentation models achieve only 66.4% frame accuracy and 30.6% F1@0.5, with performance degrading further during interaction-heavy segments. Overall, MooCap bridges multi-view pose estimation, multi-entity tracking, and structured behavioral protocols to enable interaction-aware models for animal behavior analysis.
Paperid: 3407,   Poster  
Authors: Dae-Hyeon Park, Mina Baek, Jeong-Hun Ha, Chan-Seop Park, Jamshidjon Ganiev, Seung-Hwan Bae
Title: MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
Abstract: We introduce a new templatefree tracking paradigm based solely on natural language, capable of tracking an arbitrary object and seamlessly switching to a new target without box initialization.Our key idea is to localize an object via vision-language (VL) correlation.However, using the correlation alone is brittle under large search regions due to spatial uncertainty and ambiguous VL saliency. To resolve these, we propose MVLM, a memory-based vision-language margin confidence that integrates vision–language correlation, encoder prediction, and temporal memory.MVLM dynamically gates the search region—switching between compact ROI (Region of Interest) search and global re-localization—to reduce spatial uncertainty. Theoretically, we derive bounds that connect the MVLM score to tracking probability, characterizing mis-localization within ROI and ROI-exclusion probabilities.Through extensive evaluation, we validate our theorems and achieve state-of-the-art performance on several benchmarks (TNL2K, LaSOT, OTB99 and MGIT) using only language guidance.
Paperid: 3408,   Poster  
Authors: Rotem Gatenyo, Ohad Fried
Title: Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Abstract: We study zeroshot 3D alignment of two given meshes from a short text prompt describing their spatial relation---an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time---updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer---without training a new model.Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage controlled surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, camera control concentrates views on the interaction region, and randomized restarts improve robustness.To enable evaluation, we curate a benchmark of 50 mesh pair, prompt cases spanning diverse categories and relations, and compare against baselines. Across the benchmark, our method yields semantically faithful and physically plausible alignments, improving CLIP similarity while reducing intersection volume.
Paperid: 3409,   Poster  
Authors: YIMIN WEI, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
Title: MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
Abstract: Openvocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical–SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities—optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision–language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. All dataset and code will be publicly released.
Paperid: 3410,   Poster  
Authors: Debasmit Das, Munawar Hayat, Fatih Porikli
Title: DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
Abstract: Finetuning large language models (LLMs) using reinforcement learning (RL) objectives has gained traction, especially in scenarios where labeled data is limited. Building on its success in the language domain, recent efforts have extended RL-based fine-tuning to multimodal tasks. Visual-RFT, for instance, applied Group Relative Policy Optimization (GRPO) to fine-tune multimodal LLMs (MLLMs) across various visual perception benchmarks, achieving notable improvements over standard supervised fine-tuning (SFT). However, its scope was limited by a narrow evaluation of RL adaptation strategies. In this work, we expand the landscape by introducing new RL-based baselines on the same benchmarks and conducting a deeper analysis of GRPO’s training dynamics. We identify key limitations—such as reduced generation diversity, constrained policy exploration, and suboptimal reward formulation and aggregation. To address these, we propose DEVA: a framework that enhances Diversity via a flow-based training objective, encourages broader policy Exploration through global entropic regularization, and leverages alignment Volume as a non-verifiable reward combined with harmonic Aggregation. Applied to GRPO and other RL methods, DEVA delivers consistent gains in both quantitative (+5 to +13 points) and qualitative metrics. We further provide visualizations, ablations, and analyses to unpack the contributions of each component in our framework.
Paperid: 3411,   Poster  
Authors: Jiacheng Pi, Zhiguo Yang, Xingxing Huang, Dongsheng Xu, Ruizhi Zhong, Wenjie Ruan
Title: Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection
Abstract: The integration of vision and language in VisionLanguage Models (VLMs), while enabling multimodal capabilities, inherently expands their attack surface. Among existing white-box jailbreak methods, suffix-optimization-based approaches often rely on gradient approximations over discrete token spaces, yielding insufficient guidance and causing optimization to stagnate in local optima, while image-perturbation-based ones frequently exhibit poor cross-model transferability. In this work, we introduce DGSIP, a Dissonance-Guided Suffix Optimization and Image–Phrase Injection framework. DGSIP leverages predictive dissonance between the target model and an unaligned model to identify tokens suppressed by safety alignment, using them as a more effective signal than gradient-based cues for suffix optimization. It further reinforces the attack by jointly optimizing the content and presentation of phrase embedded in images to leverage VLMs’ cross-modal sensitivity. Our extensive experiments demonstrate that DGSIP outperforms prior baselines across multiple safety benchmarks and a range of open-source VLMs (e.g., MiniGPT-4, InstructBlip and LLaVA). More importantly, compared to baselines, our method exhibits much stronger transferability to commercial black-box VLMs, such as GPT-4o-Mini, Gemini 2.0 Flash and Qwen 2.5-VL. Based upon DGSIP, we empirically reveal critical vulnerabilities in the safeguard mechanisms of current VLMs, highlighting the urgent need for more robust defense strategies.
Paperid: 3412,   Poster  
Authors: Yufeng Cheng, wenxu wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian HE
Title: Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm
Abstract: Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multireference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With multi-to-multi matching paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally. To facilitate the training of UMO, we develop a customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving.
Paperid: 3413,   Poster  
Authors: Zihui Wang, Yuhang Fu, Mengmeng Du, Zhimin Yuan, Yachen Liu, Weisheng Liao, Kaiyu Wang, Zheng Wang
Title: FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
Abstract: Collaborative fairness in federated learning ensures that clients are rewarded according to their contributions, thereby fostering longterm participation among clients. However, existing methods often under-reward low-contributing clients in the early training stage and neglect critical issues, such as consistency across local models or unequal neuron training frequencies in the aggregated model, both of which lead to degraded performance. To address these issues, we propose FedRAC, a novel Federated learning framework employing Rolling submodel Allocation for Collaborative fairness, without compromising the global model performance. First, we design a dynamic reputation calculation module with a theoretical fairness guarantee to generate reputations matching clients’ contributions. It adjusts their reputations dynamically during training, ensuring low-contributing clients access better models in the early stages for adequate training. Second, we propose a rolling submodel allocation module that assigns high-performance submodels to clients with high reputations. This module prioritizes low-frequency neurons during allocation and is supported by theoretical convergence guarantees, ensuring that all neurons in the global model are fully trained. Extensive experiments are conducted on three public datasets to confirm the advantages of our method in terms of fairness and model accuracy.
Paperid: 3414,   Poster  
Authors: Tianbo Pan, Xingyi Yang, Xinchao Wang
Title: Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
Abstract: Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multiview, lengthy visual token sequences. To surmount this challenge, we propose Merge3D, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70% visual token reduction and up to ~3× inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.
Paperid: 3415,   Poster  
Authors: Denys Iliash, Jiayi Liu, Egor Fokin, Qirui Wu, Ali Mahdavi Amiri, Manolis Savva, Angel Xuan Chang
Title: Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
Abstract: We present Artiverse, a diverse and physically grounded dataset of highquality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, vision-language model inference, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.
Paperid: 3416,   Poster  
Authors: Wei Feng, Yiwen Jiang, Sijin Zhou, Zongyuan Ge
Title: Beyond the Static World: Continual Category Discovery under Visual Drift
Abstract: Generalized Category Discovery (GCD) aims to identify both known and novel classes from unlabeled data with the aid of labeled examples. While promising, most existing GCD methods rely on simultaneous access to labeled and unlabeled datasets—an assumption often impractical in realworld deployments. Continual Category Discovery (CCD) relaxes this requirement by adapting a pre-trained model to streaming unlabeled data, yet it typically assumes domain-consistent data distributions. This places a strong limitation on its applicability. In this work, we study Open Continual Category Discovery (OCCD), where the model must robustly discover previously unseen concepts from real-world data streams that may originate from heterogeneous and shifting domains. To address this, we propose an adaptive framework built on three key ideas. First, we propose a weight-aware separation module, which leverages partial unbalanced optimal transport for instance probability modeling and employs binary response spectrum quantization to generate cues for distinguishing known and unknown categories, enabling automatic sample separation. Second, for known categories, we introduce a cross-domain semantic alignment module that incorporates adversarial learning to perform adaptive prototype matching, thereby enhancing robustness against domain shifts. Finally, for unknown categories, we design a category topology consistency constraint that preserves semantic relationships between known and novel classes during distribution shifts. Experiments show our approach excels at discovering new categories while maintaining strong performance on known ones in evolving domains.
Paperid: 3417,   Poster  
Authors: Siddhant Gole, Akash Pal, Amit More, S Divakar Bhat, Subhasis Chaudhuri, Biplab Banerjee
Title: Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation
Abstract: Continual TestTime Adaptation (CTTA) for semantic segmentation is vital for deploying vision models in dynamic environments with persistent domain shifts. Existing methods often degrade over time as self-supervised updates amplify early prediction errors. We attribute this fragility to a geometric limitation: Euclidean feature spaces, with polynomial volume growth, lead to distorted semantic representations and crowded, unstable decision boundaries. We proposeHyperProtoSeg, a hyperbolic prototypical segmentation network that learns geometrically optimal class prototypes in the Poincaré ball. Leveraging the exponential expansion of hyperbolic space, it enforces large and uniform inter-class margins with low distortion, yielding well-separated and curvature-stable embeddings. For robust online adaptation, we introduceHyperbolic Boundary Consistency Adaptation (HBCA), which partitions pixels by cross-view consistency into confident “core’’ and uncertain “boundary’’ sets. HBCA applies geodesic distance minimization for confident regions and a novel Hyperbolic Directional Consistency Loss for uncertain ones, preventing error amplification. Experiments on challenging synthetic-to-real benchmarks (Cityscapes to ACDC, IDD to IDD-AW, SHIFT) show that HyperProtoSeg + HBCA achieves an average improvement of (1.94%,4.02%,1.24%) over state-of-the-art CTTA methods under severe structural shifts.
Paperid: 3418,   Poster  
Authors: Huanjing Yue, Shangbin Xie, Cong Cao, Qian.Wu Qian.Wu, Lei Zhang, Zhao Lei, Jingyu Yang
Title: SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
Abstract: RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGBto-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to handle diverse ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection.
Paperid: 3419,   Poster  
Authors: Shiyu Qin, Yongkang Lu, Yimin Zhou, Jiawei Li, Yifan Ren, Xue Yuerong, Shu-Tao Xia, Bin Chen
Title: FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model
Abstract: Stereo image compression is essential for a wide range of 3D vision. Recent methods have demonstrated strong capabilities in eliminating interview redundancy and enabling compact entropy coding via spatial-domain stereo transformation and advanced autoregressive entropy models. However, these approaches often suffer from high-frequency information loss and incur considerable coding latency. To overcome these limitations, we propose a novel frequency stereo context transfer (FSCT) module. Unlike spatial-domain methods, the FSCT module separately captures inter-view redundancy in high- and low-frequency components and dynamically balances their contributions to preserve reconstruction quality. In addition, we replace the conventional autoregressive framework with a checkerboard strategy and integrate the FSCT module to model inter-view priors, enabling faster and more efficient entropy coding. Extensive experiments demonstrate that our method achieves state-of-the-art rate-distortion performance among existing stereo image compression approaches, while also attaining the lowest coding latency.
Paperid: 3420,   Poster  
Authors: Yan Di, Yuheng Li, Yaoxing Wang, Mengge Liu, Shan Gao, Xiangyang Ji
Title: PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
Abstract: We present PAMotion, a physicsaware diffusion framework for generating realistic full-body human interactions with multiple objects.Existing diffusion-based methods that jointly synthesize human and object motions often struggle to capture the intricate physical interactions—especially those involving complex hand–object contacts. To address this issue, in this paper, we begin with our key observation: in everyday, slow-motion scenarios, object accelerations inherently reveal the underlying physical interactions.If an object’s acceleration aligns with gravity, it is likely in free motion with no physical contact from human or other objects; otherwise, it must be in contact—directly or indirectly—with the human body. Building on this intuition, PAMotion jointly models full-body human motion, object motion, and their corresponding accelerations, enforcing physical plausibility through a physics-aware interaction loss.In this loss, we softly penalizes violations of consistency between object acceleration and human-object contact states. PAMotion follows a coarse-to-fine pipline: we first synthesize global torso and object translations, then conditionally refine hand motions and object rotations, achieving both high-level motion-text consistency and low-level physical fidelity. Experiments on two challenging datasets HIMO and ParaHome demonstrate that PAMotion achieves state-of-the-art performance in generating realistic, physically consistent full-body manipulation sequences involving multiple objects.Codes and trained models will be released soon.
Paperid: 3421,   Poster  
Authors: Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny
Title: Time Blindness: Why Video-Language Models Can’t See What Humans Can?
Abstract: Recent advances in vision–language models (VLMs) have made impressive strides in understanding spatiotemporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Sample dataset and code are available in the supplementary material.
Paperid: 3422,   Poster  
Authors: Kang Wu, Lei Yu, Junwei Luo, Bo Dang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Jingdong Chen, Yansheng Li
Title: SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery
Abstract: While recent foundation models for remote sensing (RS) segmentation have shown notable progress, they still face significant challenges, struggling to process diverse multimodal inputs, synergize complementary prompt types, and leverage semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and SAR imagery using visual, textual, or fused prompts. Based on a novel prompt-and-prediction decoupling strategy, we propose the VITA-Former and VITA-Decoder to decouple multi-modal prompt fusion and prediction process, allowing the model to flexibly support visual-only, textual-only, and fused prompt modes. We train SkySense-VITA with a progressive two-stage strategy: a first stage of Image-Level Alignment Pretraining featuring optical-SAR alignment, and a second stage of Pixel-Level In-context Pretraining using Semantic Granularity Annealing (SGA), a coarse-to-fine curriculum that enables robust hierarchical learning. To support this training, we also introduce our new large-scale, multi-modal Sky-VT-300k dataset. Extensive experiments show SkySense-VITA establishes a new state-of-the-art (SOTA) on 18 datasets, with an average performance lead of over 10% mIoU. Code, models, and data will be released upon acceptance.
Paperid: 3423,   Poster  
Authors: Wang Jiarui, Huiyu Duan, Juntong Wang, Xiongkuo Min
Title: FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models
Abstract: As generative models rapidly evolve, the realism of AIgenerated videos has reached new levels, posing significant challenges for detecting the authenticity of videos. Existing deepfake detection techniques generally rely on training datasets with limited generation methods and content diversity, which limits their generalization ability on more realistic content, particularly that produced by the latest generative models. Recently, large multimodal models (LMMs) have demonstrated remarkable zero-shot performance across a variety of vision tasks. Yet, their ability to discern deepfake videos remains largely untested. To this end, we propose FVBench, a comprehensive deep\underlinefake \underlinevideo \underlinebenchmark designed to advance video deepfake detection. It includes: (i) extensive content diversity, with over 120K videos covering real, AI-edited, and fully AI-generated categories, (ii) comprehensive model coverage, with fake videos generated and edited by 42 of the state-of-the-art video synthesis and editing models, and (iii) deepfake video detection benchmark for LMMs, which is a comprehensive benchmark for exploring the deepfake video detection capabilities of LMMs. The FVBench dataset and evaluation code will be publicly available upon publication.
Paperid: 3424,   Poster  
Authors: Simone Mosco, Daniel Fusaro, Alberto Pretto
Title: Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
Abstract: Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in realworld environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at [REMOVED DUE TO ANONYMOUS SUBMISSION].
Paperid: 3425,   Poster  
Authors: Jiayang Sun, Pin Wang, Hongbo Wang, Xinyue Liu, Huaibo Huang, Ran He
Title: Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models
Abstract: Direct Preference Optimization has achieved remarkable success in aligning diffusion models with human feedback. However, existing methods heavily rely on imagelevel preferences, which suffer from sparse rewards in the spatial dimension. This creates a fundamental misalignment: while an image may be globally preferred, it can contain locally inferior instances. Applying the same positive preference to these areas thus unfairly credits distracting regions while penalizing informative ones, leading to suboptimal performance and inefficient learning. To resolve this issue, we propose IAPO, an Instance-Aware Preference Optimization that introduces instance-level credit assignment to advance alignment from image-level to instance-level. We first construct a high-quality instance-level preference dataset by automatically identifying and relabeling corresponding instances in image pairs using vision-language models and object detection models. Leveraging this fine-grained dataset, we design a novel instance alignment loss with a dynamic reweighting mask that modulates instance-level loss within annotated bounding boxes, suppressing distractors to enforce fine-grained human preference alignment. Extensive experiments demonstrate that our method not only achieves state-of-the-art performance in multiple benchmarks but also attains higher training efficiency due to fine-grained instance-level preference labels.
Paperid: 3426,   Poster  
Authors: Ruoke Yan, Mingjia Yang, Xinfeng Zhang, Haocheng Tang, Qian Yin, Zhipin Deng, Kai Zhang, Li zhang, Siwei Ma
Title: GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression
Abstract: Humancentric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model. The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling. Experiments across multiple human-centric datasets demonstrate superior rate–distortion performance, particularly for long and densely captured sequences, and naturally enable semantic editing.
Paperid: 3427,   Poster  
Authors: Nan Yang, Julian Straub, Fan Zhang, Richard Newcombe, Jakob Engel, Lingni Ma
Title: LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Abstract: Tracking 3D human motion from egocentric, multicamera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spatio-temporal ray cloud in world coordinates. This "lift-then-fit" approach allows to learn and leverage a natural prior over world-space human motion, as well as providing an elegant framework to flexibly incorporate information from multiple, temporally asynchronous, partially observing, and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric multi-camera setting.
Paperid: 3428,   Poster  
Authors: Xiya Shen, Qinglin Zhao, Li Feng
Title: AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
Abstract: Prototype or regionattention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. AD-GBC aggregates local features into region-level semantics and redistributes the refined representation back to pixels in a fully differentiable manner, enabling geometry-aware refinement within modern UNet-style architectures. Two geometric regularizers, a Wasserstein-based diversity loss and a radius–dispersion consistency loss, prevent center collapse and encourage stable, well-formed region geometry.AD-GBC yields consistent improvements across four widely used medical segmentation benchmarks (BUSI, GlaS, CVC-ClinicDB, ISIC17) when integrated into two strong backbones (Rolling-UNet and U-KAN), demonstrating that the proposed geometric region formulation generalizes well across different imaging conditions.
Paperid: 3429,   Poster  
Authors: Zixuan Chen, Xiangrong Feng, Jieqi Shi, Lin Shao, Jing Huo, Yang Gao
Title: AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning
Abstract: The robust execution of longhorizon manipulation tasks remains a central challenge in embodied intelligence, necessitating both coherent high-level planning and reliable low-level control. Existing approaches often encounter two critical limitations: the accumulation of prediction errors in subgoal planning, leading to compounding deviations over time; and the planning-execution gap, where high-level abstract plans fail to be effectively grounded in the continuous perception-action space. To address these challenges, we propose a novel unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe). AGiLe introduces a bidirectional latent planning mechanism that jointly optimizes a backward planner and a forward critic. The backward planner generates goal-directed subgoals from the final objective, while the forward critic assesses their reachability, thereby ensuring temporal robustness through sustained consistency in long-horizon planning. Furthermore, AGiLe bridges the planning-execution gap by leveraging affordance as structural guidance, grounding abstract subgoals into dense, pixel-level visual affordances that drive action. This approach enhances spatial robustness, enabling the system to effectively adapt to semantic and visual distractors. Extensive empirical evaluations across both simulation and real-world settings confirm that AGiLe significantly outperforms strong baselines, achieving an 8.5% improvement over prior state-of-the-art methods in simulation and demonstrating its effectiveness and robustness in long-horizon manipulation tasks.
Paperid: 3430,   Poster  
Authors: Hao Guo, Liyuan Deng, Yongkang Dai, RuohanWang ruohan, Jiahao Li, Yunpeng Bai, Yilei Shi
Title: BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep
Abstract: Due to the heterogeneity of faces and edges in Brep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edges through learnable query vectors and cross-attention mechanisms. Finally, a two-stage training strategy ensures effective coupling of geometry and topology throughout. Comprehensive experiments demonstrate that BrepVGAE significantly outperforms existing methods in reconstruction accuracy, topological validity, and generative diversity. This validates the feasibility and efficacy of decoding complete CAD geometric-topological distributions from a unified latent representation, while also offering novel insights for CAD part retrieval and feature recognition domains.
Paperid: 3431,   Poster  
Authors: Dahu Shi, Chengshen He, Shaochen Zhang, Bo Qian, Xiaochen Quan, Wencong Zhang, Xing Wei
Title: Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection
Abstract: Industrial Anomaly Detection (IAD) has attracted significant attention and witnessed rapid development.However, the advancement in this field is hindered by two key issues: the performance saturation of existing benchmarks, limiting discriminative evaluation of different IAD methods, and the absence of benchmarks tailored to assess recent multimodal large language models (MLLMs) in anomaly detection.To this end, we present Omni-AD, a comprehensive IAD benchmark featuring:\romannumeral1) Large scale:The dataset consists of approximately 35K images (6× larger than MVTec) with 150 product categories (10× larger than MVTec) spanning 16 industrial sectors, delivering unprecedented diversity in terms of both category and image scale compared with existing datasets.\romannumeral2) Versatility:The benchmark supports both conventional unsupervised and emerging MLLM-based IAD evaluation protocols. The latter is achieved by defining three subtasks of progressive difficulty, with two structured as visual question answering (VQA) and one as visual grounding.\romannumeral3) Challenge:Extensive experimental results of state-of-the-art methods reveal that the Omni-AD benchmark is more challenging than existing benchmarks, which can drive the future development of the IAD field.
Paperid: 3432,   Poster  
Authors: Tong Xu, Hailong Shi, Xingyu Gao
Title: SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models
Abstract: The heavy computational burden of Large VisionLanguage Models (LVLMs) stems primarily from the lengthy visual token sequences generated by their vision encoders. To mitigate this, recent work has shifted towards pruning tokens within the vision encoder. However, we observe that these methods predominantly rely on a suboptimal decoupled heuristic method. This method is conceptually flawed: it is prone to sampling collapse, fails to fundamentally eliminate token redundancy, and tends to systematically discard secondary yet important semantic clusters.Addressing this limitation, this paper proposes to formalize visual token pruning as a unified Representativeness Optimization problem. We introduce SCoRe (Salience-Coverage Reduction), a unified optimization method theoretically grounded in the Weighted k-Center Problem. SCoRe constructs the final token set by greedily selecting tokens—at each iteration, choosing the token that maximizes the current set's unified representativeness score, thereby achieving the optimization of global representativeness. Extensive experiments demonstrate that SCoRe achieves State-of-the-Art (SOTA) performance across multiple benchmarks. Notably, with negligible computational overhead, our method reduces tokens by 94.4% while retaining 95% of the full performance.
Paperid: 3433,   Poster  
Authors: Xinwang Chen, Xiuxing Li, Qing Li, Ziyue Zhuang, Yutong Wu, Ziyu Li, Zhuo Wang, Kai Li, Jianye Hao, Xia Wu
Title: Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
Abstract: Abstract visual reasoning benchmarks such as ARCAGI evaluate the ability to infer generalizable transformation rules from few graphical demonstrations—a capability where current deep learning models severely underperform. Mainstream large language models achieve only 15.8% (DeepSeek-R1) and 34.5% (o3-mini-high) test accuracy. The core reason lies in their static processing of task examples: unlike humans, who iteratively refine their understanding of examples while constructing solutions, these models lack mechanisms for dynamically aligning understanding and solving. We address this gap with the Understanding and Solving Reasoning Loop (USRL) framework. The architecture comprises two explicitly interacting modules: an Understanding Module (UM) that encodes and refines rule representations of examples, and a Solving Module (SM) that generates a draft solution informed by these evolving contexts. Through recurrent interaction, the model iteratively aligns its draft solution with its understanding about task examples continuously. Furthermore, we introduce an adaptive reasoning halting mechanism that autonomously terminates the reasoning loop based on the consistency between the generated draft solution and the rule representations. With only 7M parameters, our model achieves 47.2% accuracy on ARC-AGI-1, significantly outperforming both DeepSeek-R1 and o3-mini-high. This reveals that neurocognitive principles offer an effective pathway for abstract reasoning, with implications extending to compositional generalization and structured problem-solving.
Paperid: 3434,   Poster  
Authors: Hugo Blanc, Jean-Emmanuel Deschaud, Alexis Paljic
Title: Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
Abstract: Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, highquality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The implementation of our approach will be made publicly available upon the paper’s publication.
Paperid: 3435,   Poster  
Authors: Zhanheng Nie, ChengHanFu ChengHanFu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng
Title: DART: Dynamic ModAlity-balanced Multimodal RepresenTation Learning for E-commerce Product Understanding
Abstract: The rapid growth of ecommerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose DART, a Dynamic modAlity-balanced multimodal RepresenTation learning framework for e-commerce product understanding. DART comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce CoIN, a Co-augmented multImodal represeNtation benchmark for e-commerce representation learning and evaluation. Experiments show that DART delivers state-of-the-art zero-shot performance on CoIN and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of DART.
Paperid: 3436,   Poster  
Authors: Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang
Title: UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Abstract: Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of largescale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%), surpassing specialized vision models like DINOv2 (49.1%), while zero-shot segmentation accuracy improves by up to 22%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.
Paperid: 3437,   Poster  
Authors: Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong
Title: CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Abstract: Causeand-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.
Paperid: 3438,   Poster  
Authors: Kota Shimomura, Hidehisa Arai, Tsubasa Takahashi, Takayoshi Yamashita, Hironobu Fujiyioshi
Title: P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, highfidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.
Paperid: 3439,   Poster  
Authors: Hengqi Liu, Wanting Zhou, Longteng Kong, Fangxiang Feng, Lei Ren, Chen Wei, Xiaojie Wang
Title: CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority
Abstract: Crossmodal alignment aims to learn semantically consistent latent representations across diverse modalities. Prevailing methods rely on a text-guided aggregation paradigm to achieve fine-grained alignment, while they suffer from redundant patch-word correlations and high computational costs. To address these issues, we propose CoV-Align, an effective and efficient fine-grained cross-modal alignment framework with cohesive visual semantics priority. Through a semantically convergent attention mechanism, it progressively aggregates meaningful visual patches in a text-free manner. We design a coarse visual semantic feature extractor that integrates deformable attention and consist assign attention to group patches with semantic consistency. A cohesive and discriminative feature optimization is presented to enhance intra-semantic cohesion and inter-semantic discriminability of visual region features, resulting in explicit improvements in cross-modal alignment. Extensive experiments demonstrate that CoV-Align achieves state-of-the-art performance on the Flickr30K and MS-COCO benchmarks. Notably, it delivers a 3–5× computational speedup compared to pioneer approaches, offering compelling advantages for large-scale multi-modal tasks.
Paperid: 3440,   Poster  
Authors: Mengxin Zhang, Yulin Wang, Chen LUO, Yongzhe Li, Yijun Zhou
Title: KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
Abstract: Rotational symmetry is an important prior in 6D pose estimation, improving pose accuracy and ensuring the consistency of symmetryaware evaluation metrics. However, current symmetry annotations for 3D objects are still largely manual or semi-automatic, often requiring predefined symmetry types or rotational orders and thus limiting scalability. This work introduces a fully automatic and reference-free framework that performs symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. The method localizes a dominant high-order axis, infers its rotational order through self-consistency analysis, and reconstructs the complete symmetry structure under a hierarchy-guided geometric formulation. A texture-aware extension further models appearance-induced reductions in rotational order while preserving axis orientations. Extensive experiments on idealized and real-world datasets demonstrate strong accuracy and generalization, achieving 94.75% accuracy on 438 symmetric objects in GSO. Training FoundationPose with these priors improves accuracy by up to 1.0% across five BOP datasets, indicating that automatically estimated rotational priors can provide quantitative gains in downstream 6D pose estimation.
Paperid: 3441,   Poster  
Authors: Weidong Tang, Zhiyuan Liang, Xinyan Wan, Chen Zhu, Zhaopan Xu, Pengfei Zhou, Yan Song, Yang You, Wangbo Zhao
Title: Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
Abstract: The large vision foundation model, SAM2, has achieved remarkable performance in video object segmentation and tracking (VOST). However, its effectiveness is hindered by significant computational overhead. While model pruning is a widely used strategy to address this issue, traditional static and inputagnostic pruning approaches fall short in managing the diverse and complex nature of video data effectively. A promising alternative is dynamic networks, yet they often struggle to translate theoretical computational reductions into actual acceleration. Furthermore, both static and dynamic approaches typically focus on visual features of individual frames while neglecting the temporal correlations between them, limiting their performance in handling complex video streams.To address these challenges, we propose Recurrent Dynamic Submodel (RDS), a dynamic architecture that adaptively selects submodel blocks for each frame. Specifically, it has a lightweight Prediction-aware Router (PAR), which leverages both the segmentation mask from the previous frame and the visual features of the current frame to make routing decisions, enabling the submodel to explicitly capture the temporal nature of video data. Additionally, to reduce the cost of adapting the dynamic submodel, we introduce an Importance-aware LoRA (I-LoRA), tuning parameters only in the most critical blocks. Extensive experiments on various benchmarks demonstrate the effectiveness of our approach.For example, it achieves a 1.3× speedup on the DAVIS 2017 dataset with less than 1% performance degradation, while introducing only 3% (6.7M) trainable parameters and requiring only 0.003% (6.7k) of the SAM2 training data.
Paperid: 3442,   Poster  
Authors: Mingyun Jeong, Seongro Yoon, Francois Bremond, Donghyeon Cho
Title: 3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors
Abstract: Despite achieving highquality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidelity. As a result, our method provides continuous and alias-free rendering across resolutions while maintaining practical scalability and memory efficiency. Extensive experiments across diverse resolution ranges demonstrate that our approach achieves an optimal balance between fidelity and memory, enabling practical arbitrary-resolution view synthesis even in resource-constrained settings.
Paperid: 3443,   Poster  
Authors: Yin Wang, Hao Lu, Zixuan Wang, Zhen Qin, Li Kuang, Mengchu Zhou, Shuiguang Deng
Title: GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
Abstract: SemiSupervised Regression (SSR) is essential in domains like sentiment analysis, healthcare, etc., where labeled data is limited but unlabeled data is plentiful. Despite its practical importance, SSR remains underexplored due to the lack of effective pseudo-labeling strategies for continuous outputs. Unlike classification, regression lacks inherent confidence measures, making it harder to filter and trust pseudo-labels. This limitation permits low-quality pseudo-labels to propagate during training without proper validation, significantly amplifying prediction errors in semi-supervised regression frameworks. In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. Our framework introduces two key innovations: 1) Gaussian Consistency Filter (GCF) that quantifies prediction consistency across weakly augmented views through Gaussian similarity scoring, retaining pseudo-labels only when all predictions fall within a confidence interval; 2) Adaptive Gaussian Standard Deviation Smoothing (AGDS) that enhances GCF's robustness through a Bayesian-regularized curriculum that phases confidence intervals from warm-up conservative bounds to progressively tightened thresholds. The use of AGDS ensures stable and reliable pseudo-label filtering throughout training. Extensive experiments demonstrate that GaussianMatch performs strongly across varying data conditions, showing notable robustness under extreme label scarcity. For instance, it outperforms the state of the art on UTKFace with only 30 labels, reducing error by 15.36% and improving the Coefficient of Determination by 50.21%.
Paperid: 3444,   Poster  
Authors: Tianyang Dai, Ming Chang, Yan Chen, Yang Hu
Title: rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
Abstract: Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on lowquality "in-the-wild" videos severely degrades model performance. A critical step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, while the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, "in-the-wild" videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks.
Paperid: 3445,   Poster  
Authors: Xuanxuan Zhang, ShuHui Shi, Tianxiang Zhang, Zhetao Guo, Zixuan Huang, You Li
Title: TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
Abstract: Multiagent 3D reconstruction, as a key technology for large-scale VR/AR, robot swarms, and digital twins, has attracted growing attention. Recent end-to-end 3D reconstruction methods achieve strong performance in single-agent scenarios, but they are difficult to directly extend to multi-agent collaborative settings, where they often suffer from unstable tracking, excessive memory consumption, and frequent loop-closure failures, thus failing to meet real-time and large-scale deployment requirements. To address these issues, we propose TOPOMA, a real-time end-to-end 3D reconstruction framework tailored for multi-agent collaboration. TOPOMA explicitly models the spatial topological structure of the scene and tightly couples it with end-to-end representation learning, thereby jointly solving core challenges such as inter-agent spatial alignment and submap fusion. Concretely, we introduce topology skeleton modeling and optimization, decentralized loop closure, and topology-guided residual transport, and build upon them a fully distributed inference architecture in which each agent can independently store, reconstruct, and incrementally optimize its map while collaborating through lightweight topological information. Extensive experiments demonstrate that, compared with existing methods, TOPOMA achieves consistently higher trajectory accuracy, reconstruction quality, robustness, and topological consistency, showing superior adaptability and scalability.
Paperid: 3446,   Poster  
Authors: Huatian Zhang, Zhendong Mao, Lei Zhang, Yongdong Zhang
Title: Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
Abstract: Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenge lies in how to transfer the sequencelevel preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model's self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model's failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments on hallucination benchmarks demonstrate its effectiveness and robustness. All code will be released.
Paperid: 3447,   Poster  
Authors: Rohan Choudhury, Jean Dandurand, Kai Qiu, Kshitij Madhav Bhat, Kartik Sharma, Liza Dahiya, Yizhou Zhao, Souraja Kundu, Chun-Hsien Lin, Kris Kitani, Laszlo Jeni
Title: FPSBench: A Benchmark for Video Understanding at High Frame Rates
Abstract: Modern videolanguage models are typically trained on videos downsampled to low frames-per-second (FPS), and the most commonly used evaluation benchmarks are designed for low-FPS input as well. To address this shortcoming, we present FPS-Bench, a large video question-answering benchmark designed to evaluate VLMs’ capabilities to understand video at high-frame rates. We introduce a new metric, the minimum frames-per-second (minFPS), which measures the minimum frame-rate required to solve a given question. While existing benchmarks require <1 minFPS, we rigorously curate more than 1000 questions from a diverse source of videos and manually verify minFPS for each example, leading to a benchmark that requires watching videos at on average 7 FPS to solve. Our evaluation of several state-of-the-art VLMs shows that they are severely lacking, achieving QA accuracy of 30% in the FPS-Bench multiple-choice task, while humans achieve 72% accuracy. We believe that FPS-Bench will serve as a valuable tool for improving frontier-level VLMs and will release all data and code.
Paperid: 3448,   Poster  
Authors: Xiaopei Zhu, Guanning Zeng, Zhanhao Hu, Jun Zhu, Xiaolin Hu
Title: Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
Abstract: Visiblethermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0^\circ–360^\circ) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.
Paperid: 3449,   Poster  
Authors: Jiaqi Liu, Zihan Tan, Guancheng Wan, Wenke Huang, He Li, Mang Ye
Title: FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction
Abstract: Federated Graph Learning (FGL) has emerged as a principled framework for decentralized training of Graph Neural Networks (GNNs) while preserving data privacy. In subgraphFL scenarios, however, structural noise arising from data collection and storage can damage the GNN message-passing scheme of clients, leading to conflicts in collaboration. Existing approaches exhibit two critical limitations: 1) Globally, they fail to identify corrupted clients, causing destructive knowledge inconsistencies. 2) Locally, the global GNN performs poorly on these clients due to structural noise, limiting their ability to benefit from federated collaboration. To address these challenges, we propose FedSDR, a spectra-based FGL framework against high-structural-noise scenarios. Specifically, Structural Noise-Aware Aggregation (SNAA) introduces a noise evaluation metric to detect corrupted clients and reduce their contributions, thereby mitigating the impact of noise on the global GNN. Furthermore, Robust Local Structure Reconstruction (RLSR) leverages the knowledge from the healthy global model to repair locally corrupted graph structures. Extensive experiments demonstrate that FedSDR outperforms state-of-the-art methods across various scenarios under structural noise.
Paperid: 3450,   Poster  
Authors: Caleb Zheng, Eli Shlizerman
Title: 2nd Match: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
Abstract: Diffusion models achieve remarkable performance across diverse generative tasks in computer vision, but their high computational cost remains a major barrier to deployment. Model pruning offers a promising way to reduce inference cost and enable lightweight diffusion models. However, pruning leads to quality degradation due to reduced capacity. A key limitation of existing pruning approaches is that pruned models are finetuned using the same objective as the dense model (denoising score matching). Since the dense model is accessible during finetuning, it warrants a more effective approach for knowledge transfer from the dense to the pruned model. Motivated by this, we propose 2ndMatch (2ndM), a generalpurpose finetuning framework that introduces a 2nd-order Jacobian Matching loss inspired by Finite-Time Lyapunov Exponents. 2ndM teaches the pruned model to mimic the sensitivity of the dense teacher, i.e., how to respond to small perturbations over time, through scalable random projections. The framework is architecture-agnostic and applies to both U-Net- and Transformer-based diffusion models. Experiments on CIFAR-10, CelebA, LSUN, ImageNet, and MSCOCO demonstrate that 2ndM reduces the performance gap between pruned and dense models, substantially improving output quality.
Paperid: 3451,   Poster  
Authors: Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Josef Kittler, Muhammad Awais
Title: SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head
Abstract: Generating realistic and expressive audiodriven talking avatars remains a central challenge in digital human synthesis. Existing methods often depend on intermediate representations such as pose estimations for natural body motion, which restricts flexibility and adds visual distortions. Moreover, most audio-driven approaches rely on discrete emotion classifiers or text labels to regulate facial expression, reducing complex affective dynamics to coarse categories such as happy, sad, or angry. Such categorical supervision fails to capture the continuous and fine-grained speech dynamics (rhythm, energy, intensity) resulting in limited synchronization and emotionally shallow motion. To overcome these limitations, we present SyncDreamer, a unified Diffusion Transformer framework that generates identity-preserving and emotionally expressive talking avatars from only a single image, speech audio, and text prompt.We propose a visual adapter with Attention Localization Loss to maintain identity fidelity, further incorporating an audio dynamics encoder for rhythm- and emotion-aware motion, and an RL-based Cross-Modal Prompt Enhancer grounding textual cues in visual context for fine-grained motion control. Extensive experiments on portrait and full-body benchmarks demonstrate state-of-the-art performance in realism, synchronization accuracy, and semantic controllability, establishing a scalable foundation for expressive digital avatars in interactive and creative applications.
Paperid: 3452,   Poster  
Authors: Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Haoqi Hu, Chengming Xu, Matt Fredrikson
Title: Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs
Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across diverse applications, from autonomous driving to document understanding. As these models are deployed in safetycritical contexts, understanding their adversarial robustness becomes crucial. However, current evaluations focus primarily on simple tasks like coarse-grained classification, and employ inconsistent evaluation protocols, hindering rigorous comparison of attack methods. We introduce AdvRobustBench, a comprehensive adversarial robustness benchmark for MLLMs comprising 1,000 examples across visual question answering (VQA) and optical character recognition (OCR) tasks, drawn from widely-used MLLM benchmarks (MMBench, MMStar, OCRBench-v2). We further propose Omni-Attack, a novel transfer-based black-box attack method that addresses key challenges in attacking open-ended question-answering systems. Our approach introduces (i) a target-construction pipeline that generates question-conditioned textual and visual targets to provide stronger optimization signals, and (ii) a location-aware attack strategy for OCR that enables spatially-precise perturbations. Extensive experiments demonstrate that Omni-Attack achieves strong targeted attack success rates (up to 71.8% on GPT-4.1 at \varepsilon=8/255) across both proprietary models (GPT-4.1, Claude 3.7, Gemini 2.0) and open-source MLLMs, revealing significant vulnerabilities in current multimodal systems. Our benchmark and findings establish a foundation for developing more robust MLLMs.
Paperid: 3453,   Poster  
Authors: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
Title: Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a codeonly paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms code-only baselines, advanced MLLMs, and existing layout models, establishing visual feedback as critical for design-oriented MLLMs. The code will be publicly available in the future.
Paperid: 3454,   Poster  
Authors: Mohaiminul Al Nahian, Abeer Almalky, Sabbir Ahmed, Abdullah Al Arafat, Mamshad Nayeem Rizve, Adnan Rakin Rakin
Title: Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
Abstract: The remarkable success of modern Deep Neural Networks (DNNs) can be primarily attributed to having access to compute resources and highquality labeled data, which is often costly and challenging to acquire. Recently, text-to-image Diffusion Models (DMs) have emerged as powerful data generators to augment training datasets. Machine learning practitioners often utilize off-the-shelf third-party DMs for generating synthetic data without domain-specific expertise or adaptation. Such a practice leads to a novel and insidious threat: diffusion-model infected with a backdoor can effectively spread into a large number of downstream models, causing a backdoor pandemic. To achieve this for the first time, we propose Eidolon, designed and optimized to stealthily transfer the backdoor injected into a single diffusion model into virtually an infinite number of downstream models without any active attacker role in the downstream training tasks. Proposed Eidolon not only makes the attack stealthier and effective, it also enforces a strict threat model for injecting backdoor into the downstream model compared to conventional backdoor attacks. We propose four necessary tests that a successful backdoor attack on the diffusion model should pass to cause a backdoor pandemic. Our evaluation across a wide range of benchmark datasets and model architectures exhibits that only our attack successfully passes these tests, causing widespread pandemic across many downstream classifiers.
Paperid: 3455,   Poster  
Authors: Abhiroop Chatterjee, Susmita Ghosh, Ashish Ghosh, Emmett Ientilucci
Title: CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
Abstract: Recent advances in vision–language models (VLMs) have revealed both the promise and the rigidity of largescale pretraining. Despite their impressive zero-shot generalization, existing adaptation paradigms—whether prompt tuning, adapter injection, or fine-tuning—remain class-specific, modality-biased, and structure-agnostic. However, these design choices limit reasoning-level transfer across tasks. To this end, we rethink adaptation as a shared conceptual structure rather than a per-class specialization. We propose CASPA (Concept-Anchored Semantic Prompt Adapter), a dual-anchor semantic adapter that jointly learns shared text and image anchors as a bidirectional conceptual interface between modalities. Each class learns a soft association distribution over these anchors, producing compositional representations that enable parameter sharing and semantic reuse. To further align visual and textual reasoning spaces, CASPA employs Semantic Cross-Consistency Regularization (S-XCR), enforcing geometric and semantic agreement between text- and image-conditioned anchor mixtures. To the best of our knowledge, this is the first work to jointly model graph-structured semantic adaptation and cross-modal regularization for unified, reasoning-level vision–language alignment. CASPA is evaluated across four adaptation regimes—base-to-novel generalization, few-shot learning under data scarcity, cross-data transfer, and backbone-agnostic few-shot evaluation. Evaluated on eleven diverse visual recognition datasets, it matches or outperforms several state-of-the-art methods.
Paperid: 3456,   Poster  
Authors: Lixin Xue, Chengwei Zheng, Georgios Paschalidis, Chen Guo, Manuel Kaufmann, Juan Jose Zarate, Dimitrios Tzionas
Title: RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
Abstract: Reconstructing people, objects, and their interactions in 3D is a longstanding and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. We will release our code and data to foster future research.
Paperid: 3457,   Poster  
Authors: Sungik Choi, Hankook Lee, Jaehoon Lee, Robin Kim, Stanley Jungkyu Choi, Moontae Lee
Title: A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
Abstract: As recent AI models have successfully generated highresolution photorealistic images, it has also been socially important to detect whether an image is generated by AI. Since training data for the detection task is often not available due to the diversity of generative models, training-free detection approaches have been practically considered. A common approach is to utilize the image-level reconstruction error from the latent diffusion model (LDM). However, we find this score suffers from instance-specific biases, particularly in images with simple backgrounds. To this end, we propose a novel image-level debiasing score function that cancels out background contribution by normalizing the reconstruction error on the augmented images with similar background information. To be specific, we show that rotation and low-pass filtering are effective augmentation strategies. To promote generalization to broader generative models, we newly explore latent-level reconstruction error as an additional training-free signal. However, we observe that the latent-level score also suffers to latent-specific bias. To mitigate this, we introduce a rotation-based latent-level debiasing score based on the normalization of the rotated latent. We unify the aforementioned scores into a single unified debiasing score, RDD, which achieves state-of-the-art training-free detection performance across diverse generative models. Furthermore, our framework can be robust to corruption of the examined images.
Paperid: 3458,   Poster  
Authors: Jebastin Nadar, Simone Foti, Tolga Birdal
Title: PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
Abstract: Generative pose priors have recently emerged as a powerful tool for inference under occlusion or noise. Yet today’s strongest generative paradigm,flow matching, remains unused for human pose due to two fundamental barriers: the absence of a pretrained flow prior and the non-Euclidean nature of articulated poses. We overcome both by introducingPoseD-Flow, a novel framework to unify Riemannian Flow Matching (RFM) with training-free guidance for 3D human pose recovery. PoseD-Flow is composed of two contributions: (i)PoseRFM, the first RFM model of human pose, defined directly on the product manifold of joint rotations, and (ii)Riemannian D-Flow, a principled guidance mechanism that, by differentiating through its ODE sampling dynamics, conditions PoseRFM at inference without any task-specific training. Our theoretical analysis shows that the induced dynamics are shaped by data covariance and manifold curvature, yielding a bias toward realistic poses. Across pose completion, denoising, and inverse kinematics, \MethodName~establishes new state of the art, particularly under noise, occlusion, and partial observations.
Paperid: 3459,   Poster  
Authors: Panagiotis Filntisis, George Retsinas, Radek Danecek, Vanessa Sklyarova, Petros Maragos, Timo Bolkart
Title: Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
Abstract: Recent learningbased face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training and artifacts, and instead use pointmap- and normal-based losses that provide smoother gradients, more stable optimization, and improved reconstruction results.Additionally, we introduce at inference a brief test-time-optimization scheme which can further refine the results of the network, resulting in registrations that outperform traditional labor-intensive pipelines.Despite removing external registrations, our extensive experimental results show that MOCHI surpasses the previous state-of-the-art in reconstruction accuracy and visual fidelity. The code and the model will be made public.
Paperid: 3460,   Poster  
Authors: Haoyue Tan, Shengnan Wang, Yulin Qiao, juncheng zhang, Youhui Bai, Ping Gong, Zewen Jin, Cheng Li
Title: AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
Abstract: Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity, or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training‑free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an anglesimilarity preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX‑2B, HunyuanVideo, and Wan‑2.1 via one A40 GPU demonstrate up to 1.67x-4.31x speedup with negligible quality degradation.
Paperid: 3461,   Poster  
Authors: Xiaoqian Cheng, Dong Xiao, Husen Li, Zheng Liu, Renjie Chen
Title: Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
Abstract: Point cloud denoising is a critical preprocessing step for enhancing the reliability and accuracy of 3D perception systems. Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and oversmoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that adaptively determines the optimal denoising path for each local patch based on its noise characteristics. DSNet incorporates a noise discriminator that quantifies local noise intensity by analyzing normal similarity, and a reverse monotonic decision function that maps this measure to an appropriate denoising module. Furthermore, we propose a Path-Selective Iteration mechanism that dynamically re-evaluates the restoration state and re-plans the denoising route at each stage, enabling cross-stage skipping to minimize unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that DSNet achieves state-of-the-art performance in noise suppression, geometric fidelity, and computational efficiency. Our code and models will be made publicly available at github.
Paperid: 3462,   Poster  
Authors: Yitong Qin, Lihua Zhou, Jiwei Wei, Ran Ran, Shiyuan He, Zeyu Ma, Shuaifeng Li, Nianxin Li, Heng Tao Shen
Title: ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
Abstract: TestTime Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods—whether closed-set or open-vocabulary—focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and classification logits. ViTPrompt requires no backpropagation, parameter updates, or external memory, making it highly efficient for real-time deployment. Experiments on multiple out-of-distribution benchmarks demonstrate that ViTPrompt achieves state-of-the-art performance, delivering consistent improvements in both localization accuracy and classification fidelity , and establishing itself as a holistic solution for open-vocabulary TTAOD.
Paperid: 3463,   Poster  
Authors: Zhilu Yang, Mingcheng Li
Title: Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
Abstract: Multimodal Sentiment Analysis (MSA) aims to comprehensively and robustly interpret human emotions by integrating information from verbal, visual and acoustic modalities. However, the performance of existing models is often hampered by two key challenges: insufficient multilayer semantic extraction inherent to modalities and static feature fusion, leading to low performance. Therefore, this paper proposes a Multifactor Factor-Decoupling and Semantics-enhanced Fusion Framework for accurate multimodal sentiment analysis. First, each modality is decomposed into three orthogonal subspaces based on a multidimensional information separation mechanism, which is regulated by a contrast constraint for subspace separation, an information gain constraint for maximizing the capture of task-relevant features, and a pairwise constraint for ensuring complementary subspaces. Subsequently, a variational purification strategy is introduced to further ensure the semantic integrity of each sentiment representation. Finally, the fusion module computes the adaptive fusion weights in parallel using multiple orthogonal factors such as sample-level modality saliency, global subspace type importance and feature-level internal attention. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method.
Paperid: 3464,   Poster  
Authors: Xiaodong Wang, Zhirong Wu, Langling Huang, Yuxi Zheng, Peixi Peng
Title: Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning
Abstract: Multimodal Large Language Models (MLLMs) have made great progress in video understanding tasks. However, when it comes to understanding complex or lengthy videos, MLLMs tend to overlook details or produce hallucinations. To alleviate these issues, recent work has attempted to leverage reinforcement learning (RL) to boost models' deep linguistic reasoning of complex videos. But these methods have two main problems: First, the RL framework they used has unstable training, high training costs, and is difficult to train satisfactory video reasoning models; Second, the linguistic reasoning process is difficult to guarantee the reliability of visual information. To alleviate these problems, we propose to use multimodal elements for reasoning, and we design a novel framework to build and enhance versatile video reasoning capabilities on MLLMs. We carefully design a multitask cold start and multi-task reinforcement learning to improve the model's visual perception and proficiency in multiple capabilities. In the inference phase, we leverage multimodal reasoning and dynamic sampling to further improve the performance. We verified the efficiency of the framework on a base MLLM (Qwen2-VL-7B-Base). Through cold-start with 3k data and reinforcement learning training with 5k data, combined with inference design, our final model significantly outperforms the base model on seven public video benchmarks, even surpassing and approaching the state-of-the-art Instruct Models such as Qwen2.5-VL-7B-Instruct trained with large-scale data.
Paperid: 3465,   Poster  
Authors: Xuanhang Chang, Zhonghao Yang, Cheng Zhuo, YU LI
Title: MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
Abstract: Diffusionnative watermarking provides a more secure and reliable way to trace images from latent diffusion models (LDMs) by embedding information directly into the generative process. However, existing methods suffer from a fundamental limitation: their embedding capacity is extremely small. We introduce MaxMark, a high-capacity watermarking framework that supports embed rich watermark messages into generated images. MaxMark uses two components: a robust watermark embedding module that enhance the secret message and places them into reliable regions of the latent noise, and a distribution transformation module that maps the watermarked latent back to an approximate Gaussian, ensuring compatibility with the diffusion process and preserving image fidelity. The distribution transformation is implemented with an invertible neural network (INN), whose exactly reversible structure enables precise recovery and efficient training. Experiments show that MaxMark surpasses prior methods in capacity, robustness, and imperceptibility, achieving up to a 46% improvement in bit accuracy for large watermark payloads.
Paperid: 3466,   Poster  
Authors: Yalan Qin, Hanzhou Wu
Title: Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering
Abstract: Largescale multi-view clustering aims to explore the complementary and consistent information among different views in efficient manner. Despite the impressive performance gained by the existing methods, they just perform anchor learning in a single space with the orthogonal or some other constraints from the multi-view data, leading to undesired anchors. The anchors can simultaneously occur in more spaces and the complementary information among these spaces is able to be adopted for learning anchors. Meanwhile, the space with basis being the anchored cluster center is neglected to learn anchors by most existing works. In this work, we propose learning anchor in Dual Orthogonal Space for Fast Multi-view Clustering (DOSFMVC). DOSFMVC conducts anchor learning in dual orthogonal space, aiming at utilizing the complementary information among two spaces in producing anchors with high quality. DOSMFVC introduces the consensus anchored cluster center as basis of the extra space and clustering indicator of anchors based on this bais in anchor learning. The anchor learning and partition are integrated into a unified model, where the final cluster assignment can be adopted for clustering results. Extensive experiments confirm the superiority of our method compared with some state-of-the-art methods on several benchmark datasets.
Paperid: 3467,   Poster  
Authors: Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue
Title: InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
Abstract: Languageguided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process.In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency.To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency.Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity.
Paperid: 3468,   Poster  
Authors: Yixin Xiong, Ke Wang, Tongtong Cheng, Chunhui Liu, Kai Liu
Title: BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization
Abstract: Bird’s Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEVCAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. By optimising each ray and incorporating depth features, BEV-CAR effectively addresses the challenges posed by object occlusions and varying environmental conditions. It ensures robust performance across diverse scenarios, particularly improving the accuracy of foreground object segmentation and layout estimation in occluded areas. And extensive experiments on the nuScenes and Argoverse datasets demonstrate that BEV-CAR achieves state-of-the-art (SOTA) performance. More importantly, the rasterization technique in this paper does not introduce additional computational overhead during the inference process, making it suitable for practical deployment in real-world scenarios. Code and technical appendix are available in supplementary material.
Paperid: 3469,   Poster  
Authors: Yuanpeng Tu, Yunpeng Chen, Xinyu Zhang, Chao Liao, Hengshuang Zhao
Title: Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
Abstract: MeanFlow is a powerful fewstep generative framework that can be trained from scratch, but its performance degrades significantly when the one-step loss uses a large portion of training data. This stems from a temporal scale imbalance: gradients from different stages of generation contribute unevenly, leading to unstable optimization—evident in blurry samples and high FID scores. The core issue is a conflict between two opposing forces: terms that amplify variance over long time spans and strong constraints needed near the start of generation, which a fixed sampling strategy cannot reconcile. To resolve this, we propose Temporal Equilibrium MeanFlow (TEMF), which balances these competing demands through two simple yet effective components: (1) a temporal equilibrium weighting function that equalizes gradient influence across all time scales, and (2) a dynamic boundary scheduler that gradually shifts training focus—from stabilizing early steps to refining the full trajectory as training progresses. Without changing the model architecture, TEMF retains true one-step generation with classifier-free guidance, achieving a state-of-the-art FID of 2.62 on ImageNet 256×256—achieving the best results among diffusion- and flow-based one-step methods.
Paperid: 3470,   Poster  
Authors: Mincheol Kwon, MINSEUNG LEE, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Park Cheonyoung, Yongho Song, Seunghyun Park, Jinkyu Kim
Title: Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
Abstract: Large VisionLanguage Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens. Our code and datasets will be publicly released to facilitate future research.
Paperid: 3471,   Poster  
Authors: Zeyu Jiang, Lai Man Po, XUYUAN XU, Yexin Wang, Guoping Gong, Haoxuan Wu, Chenbo Yan, Kun Li, Yuyang Liu
Title: OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
Abstract: Multimodal image synthesis has achieved remarkable progress in producing visually coherent results, yet most editing methods still rely on semantic instructions, which is less direct than using visual guidance.Recently, a new paradigm has emerged that focuses on "editing one image from another", enabling more direct and interpretable manipulation through reference exemplars. In this work, we formalize this paradigm as crossimage editing, which modifies a source image under the guidance of one or more references, encompassing subject replacement, style transfer, image completion, and other reference-to-source tasks. To address this, we introduce OrionEdit, a unified framework that regulates visual attribute transfer through two key mechanisms: (1) A symmetric orthogonal subspace update that partitions image features into branch-specific subspaces, mitigating feature entanglement and preserving subject identity; and (2) a reverse-causal attention mechanism with an information-flow mask that enforces unidirectional dependencies in the latent space. Built on standard diffusion backbones, OrionEdit enables zero-shot editing with multiple references and yields consistent gains over open-source baselines, rivaling proprietary models in fidelity and disentanglement.
Paperid: 3472,   Poster  
Authors: Kaiyang Lan, Ying Cui, Chenchen Jing, Jianwei Zheng, Dongyan Guo
Title: Beyond explicit language: plug-and-play visual-to-Linguistic modeling towards general object tracking
Abstract: Natural language provides valuable auxiliary information for enhancing visual object tracking. While existing visionlanguage tracking methods explicitly leverage linguistic descriptions to aid tracking, they suffer from two critical limitations: the inability to dynamically adapt descriptions to the moving target and changing context; and the strong dependency on language input may causes failure when text is unavailable. To address the issues, we design a simple yet effective plug-and-play module that leverages linguistic assistance implicitly, without requiring explicit language input. The proposed textual inversion module converts visual features from template and search regions into text tokens in the CLIP text embedding space. It effectively inverts visual representations into linguistic forms, integrating contextual information from the both template and search region. The linguistic cues are then injected into the visual feature space via a multi-layer semantic injection mechanism. The design enhances the completeness of cross-modal feature representations and the accuracy of inter-modal semantic alignment, thus enabling dynamically updated linguistic information guidance for general object tracking. Extensive experiments demonstrate the effectiveness of our proposed method. We integrate the proposed module into several advanced trackers and evaluate on both visual and vision-language tracking datasets, including MCITrack, DUTrack, and SeqTrack.By training only the newly introduced module and the corresponding decoder, the proposed approach achieves significant performance gains with minimal computational overhead. Code will be made publicly available.
Paperid: 3473,   Poster  
Authors: Zheng-Hui Huang, Zhixiang Wang, Yu-Lun Liu, Yung-Yu Chuang
Title: Reflection Separation from a Single Image via Joint Latent Diffusion
Abstract: Singleimage reflection separation remains challenging due to its ill-posed nature, especially under extreme conditions with strong or subtle reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios, because of insufficient information. This paper presents the first diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments show our approach achieves superior separation performance on multiple real-world benchmarks and surpasses state-of-the-art methods in both quantitative metrics and perceptual quality.
Paperid: 3474,   Poster  
Authors: Chentao Song, He Zhang, Yuan Haolei, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu
Title: MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
Abstract: We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module.To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixtureof-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position.Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space.Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.
Paperid: 3475,   Poster  
Authors: Yixin Fan, He Zhao, Yuxin Hou, Changhua Zhou, Zihao Liu, Peng Wang, Lu ChengLong, Xu Zhang, Wei Wang
Title: DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
Abstract: While prior research on Multimodal Large Language Model (MLLM) hallucinations has primarily examined crossmodal inconsistencies in natural images, hallucination over complex graph structures remains underexplored.Concurrently, there is a lack of robust evaluation for fine-grained reasoning integrating structural, visual, and semantic information.To address these gaps, we present DiGraphHal-Bench, the first large-scale Visual Question Answering (VQA) benchmark for evaluating both hallucination phenomena and fine-grained reasoning of MLLMs on real-world directed graphs. DiGraphHal-Bench comprises high-quality procedural graphs from over six distinct domains and is organized around a taxonomy of four high-level capabilities and twelve fine-grained tasks. To ensure benchmark fidelity, we propose a novel two-stage automatic data curation pipeline that reconciles the trade-off between data scale and quality, thereby guaranteeing reliable evaluation.Experiments reveal that state-of-the-art MLLMs hallucinate notably in fine-grained graph reasoning. Although SFT substantially mitigates these hallucinations and strengthens complex reasoning, performance remains far from optimal. Ablation studies highlight the importance of fundamental capabilities for integrative reasoning, and our benchmark provides a foundation for advancing robust multi-modal graph understanding.
Paperid: 3476,   Poster  
Authors: Xiaofan Que, Dingrong Wang, Xumin Liu, Qi Yu
Title: InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
Abstract: Testtime prompt tuning (TPT) has emerged as a powerful technique for adapting pre-trained vision-language models (VLMs) to diverse downstream tasks, including image classification and visual reasoning. With the rise of text-driven object detectors, we extend TPT to object detection, unlocking new capabilities for cross-domain adaptation. However, a critical challenge in TPT is the inherent miscalibration caused by entropy minimization: domain shifts often lead to incorrect predictions, and enforcing high confidence exacerbates miscalibration, ultimately degrading performance. To tackle this, we introduce InsCal, a novel framework designed to enhance cross-domain object detection through three key innovations: (1) extending TPT to a multi-source paradigm, enabling knowledge aggregation across diverse domains; (2) reducing domain gaps via a novel text-driven style transfer strategy that aligns features to the source domain without requiring reference images; and (3) refining the entropy minimization objective with instance-specific calibration, ensuring robust and well-calibrated adaptation. Our approach not only mitigates miscalibration but also significantly improves cross-domain object detection performance, setting a new benchmark for test-time adaptation in VLMs.
Paperid: 3477,   Poster  
Authors: Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu
Title: WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
Abstract: Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision–Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixelgrounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are included in the supplementary materials.
Paperid: 3478,   Poster  
Authors: Timo Lüddecke, Jan Frederik Meier, Jan van Delden, Alexander Ecker
Title: LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
Abstract: Parameterefficient fine-tuning (PEFT) methods have recently gained popularity for applying deep neural networks on small datasets as they reduce overfitting, simplify deployment, and enable fast training. We demonstrate that for dense image prediction tasks, a well-designed and lightweight dense readout on top of a frozen large backbone can surpass state-of-the-art PEFT methods in both efficiency and accuracy. Our parameter-efficient readout module combines interpolation and attention for fine-grained dense prediction. It integrates seamlessly with a wide range of pretrained vision backbones such as DINOv3. We achieve competitive or superior performance in semantic segmentation, object detection, pose estimation and semantic contour prediction, offering an efficient alternative to current PEFT techniques. Code: https://to.be.released
Paperid: 3479,   Poster  
Authors: Jiale Huang, Shangfei Wang
Title: SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
Abstract: 3D Referring Expression Segmentation (3DRES) aims to segment objects in point clouds according to language descriptions. Unlike common practices in 2D that utilize learnable query embeddings, recent 3D-RES methods typically generate queries directly from 3D points. However, this direct coupling of queries to raw point clouds introduces new challenges: an impractically large number of queries derived from massive point cloud data and a reliance on non-deterministic sampling algorithms. In this paper, we propose a Semantic-based Adaptive Query Network (SAQN), which introduces a novel query strategy for 3D-RES. Instead of generating queries from points, SAQN employs a learnable query vector for each semantic class. This approach drastically reduces the number of queries while maintaining the advantage of avoiding Hungarian matching through implicit class alignment. Additionally, to address potential cross-object ambiguity within semantic classes, we introduce supplementary queries that are adaptively fused with each class query to disambiguate and enrich representations. Comprehensive experiments show that SAQN achieves state-of-the-art performance while reducing the number of queries.
Paperid: 3480,   Poster  
Authors: Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito
Title: Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy
Abstract: Immunofluorescence (IF) images reveal detailed information about structures and functions at the subcellular level. However, unlike RGB images, IF datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Although existing approaches build channeladaptive models for training, they do not perform evaluations across IF datasets with unseen channel configurations. To address this, we first introduce a biologically informed view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference for the concept channels in the image. We leverage this view to propose Channel Conditioned Cell Representations (C3R), a framework that learns representations that transfers well to both in-distribution (ID) and out-of-distribution (OOD) datasets which contain same and different channel configurations, respectively. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while yielding state-of-the-art results on frozen encoder evaluation on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IF datasets, with no need for retraining on unseen channel configurations.
Paperid: 3481,   Poster  
Authors: Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan
Title: RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation
Abstract: World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RayNova, a geometryfree world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RayNova constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RayNova achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation.
Paperid: 3482,   Poster  
Authors: Zhuojiang Cai, Zhenghui Sun, Feng Lu
Title: GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion
Abstract: We present GazeOnce360, a novel endto-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations.Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios.
Paperid: 3483,   Poster  
Authors: YINGKAI YANG, Chaoqi Chen, Hui Huang
Title: Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
Abstract: TestTime Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes --- a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.
Paperid: 3484,   Poster  
Authors: Zhirong Shen, Rui Huang, Jiacheng Liu, Chang Zou, Peiliang Cai, Shikang Zheng, zhengyi shi, Liang Feng, Linfeng Zhang
Title: Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Abstract: Diffusion Transformers (DiTs) have achieved stateof-the-art image and video generation performance, but sampling remains expensive due to repeated transformer forward passes over many timesteps. Feature caching offers a training-free way to accelerate inference by reusing or forecasting hidden representations, yet recent forecasting-based methods derive their coefficients from hand-crafted formulas (e.g., Taylor expansion), which ultimately reduce to fixed linear combinations of a few historical features. Such fixed coefficients are suboptimal and fragile under aggressive skipping. In this paper, we first show that existing forecasting-based caching methods can be unified in a common linear form, and then analyze DiT feature trajectories, finding that for most denoising steps the current feature can be reconstructed from past features with projection fidelity above 0.95, indicating that accurate linear prediction is feasible. Motivated by this, we propose L^2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces hand-designed coefficients with learnable per-timestep weights trained on a small set of cached trajectories using a mean-squared error loss, converging in about 20 seconds on a single GPU. Extensive experiments on state-of-the-art DiTs demonstrate that L2P consistently outperforms existing caching baselines: on FLUX.1-dev, L2P achieves a 4.55× FLOPs reduction and 4.15× latency speedup with a PSNR of 31.459, and on Qwen-Image and Qwen-Image-Lightning, it maintains high visual fidelity even under up to 7.18× acceleration, where prior methods suffer from noticeable quality degradation. These results show that learning linear predictors is a practical and effective alternative to designing increasingly complex forecasting formulas for efficient diffusion model inference.
Paperid: 3485,   Poster  
Authors: Yang Zepeng, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan, Yongfeng Yin, Bin Li
Title: VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
Abstract: The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the stateof-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual–inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba’s dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code will be available on GitHub.
Paperid: 3486,   Poster  
Authors: Qiang Qi, Wenqi Shang, Meifang Wang, Xiao Wang
Title: D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
Abstract: Accurately capturing and aggregating spatiotemporal information has become crucial for video object detection. Previous methods mainly perform feature aggregation in the spatiotemporal domain, treating all regions indiscriminately and overlooking both their relative importance and the frequency characteristics that capture periodic motion patterns. This limits the capability of these methods to capture dynamic interactions and adapt to complex scene variations. In this paper, we propose a novel DualDomain Feature Aggregation Network (D2FANet) for video object detection, which, to the best of our knowledge, is the first work to introduce the frequency-domain feature aggregation into the video object detection task. By collaboratively modeling spatiotemporal and frequency information, our D2FANet enhances motion awareness and temporal consistency, thereby improving detection accuracy. First, we develop a frequency-domain feature aggregation module that decomposes frame features into high- and low-frequency distributions and reinforces object query representations through aggregating multi-scale frequency features. Second, we design a spatiotemporal-domain feature aggregation module that leverages an importance guidance mechanism to dynamically emphasize regions of different importance and reinforces object query representations via guiding the aggregation of spatiotemporal features. Experiments on the ImageNet VID and EPIC-KITCHENS datasets demonstrate that D2FANet achieves state-of-the-art performance. The code will be made available.
Paperid: 3487,   Poster  
Authors: Dongsheng Li, Xinyuan Guo, Huijie Zhang, Pingting Hao, Qiushi Xia
Title: Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Abstract: The core challenge in Textbased Person Retrieval (TPR) lies in establishing fine-grained, many-to-many semantic alignment between textual words and visual regions. Existing methods predominantly rely on pointwise similarity or attention mechanisms, implicitly assuming matches are independent and balanced. Consequently, under conditions of attribute overlap and substantial background noise, these methods often misallocate matching weights to non-discriminative regions or words, resulting in ambiguous matching outcomes. To address this, we propose QC-Align, a quota-calibrated fine-grained alignment framework guided by context-aware marginals. Specifically, we propose a Context-Aware Marginal Estimator (CAME) that dynamically assigns "matching quotas" to each word and visual region, and subsequently employs a Quota-Calibrated Transport (QCT) objective to explicitly constrain the matching quality each word and region can carry, thereby jointly optimizing the many-to-many correspondence between text and vision under these constraints. Notably, QC-Align is a parameter-free, plug-and-play training regularizer that requires no fine-grained annotations and incurs no inference overhead. Experiments on multiple mainstream person retrieval benchmarks demonstrate that QC-Align consistently improves baseline model performance, with greater gains and better interpretability in few-shot and cross-domain scenarios.
Paperid: 3488,   Poster  
Authors: Wenbin Tan, Jiawen Lin, Yuan Xie, Yachao Zhang, Yanyun Qu
Title: UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
Abstract: ZeroShot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open-Vocabulary Multi-Source Spatial Annotation and Reasoning Chain Generator processes RGB-D images or 3D-projected 2D images from open-world scenes to generate spatial pseudo-labels and reasoning chains for training. Then, we propose Reasoning Chain Distillation, which transfers reasoning knowledge extracted by a large teacher network to a lightweight student network. To represent both global and local geometric relationships, the Geometry-Aware Spatial Modeling (GeoSM) module is introduced to align textual reasoning with 3D spatial structures. Experiments show that UZ3DVG achieves SOTA zero-shot performance on ScanRefer and NR3D, with inference speeds up to 7.7~\mathrmFPS, approximately 38 times faster than SOTA methods.
Paperid: 3489,   Poster  
Authors: Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
Title: H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
Abstract: Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interactionbased methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.
Paperid: 3490,   Poster  
Authors: wenjie mu, Zhan Li, Chuanzhou su, XUANYI SHEN, Ziniu Liu, Fan Lu, Yujian Mo, Junqiao Zhao, Tiantian Feng, chen ye, Guang Chen
Title: MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
Abstract: Generalizable Neural Radiance Fields (GeNeRF) enable highquality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complementary uncertainty modeling tasks: Source-view uncertainty, serving as a transferable prior during the feed-forward process, captures structural inconsistencies across source views caused by viewpoint changes or dynamic factors; Target-view uncertainty focuses on observation anomalies caused by transient changes to infer distractor spatial distribution. These two uncertainties are integrated into a heteroscedastic reconstruction loss that guides adaptive supervision weighting, boosting the model's capability to detect and suppress distractors, and enabling more robust geometric modeling. To our knowledge, this is the first attempt to explore GeNeRF modeling with in scenes with transient distractors. Extensive experiments demonstrate that our method not only outperforms existing GeNeRF approaches but also rivals the performance of scene-specific distractor-free NeRFs.
Paperid: 3491,   Poster  
Authors: Weiyu Zhao, Ru Li, Jiaqi Liu, Sizhe Zhao, Qinglin Liu, Shengping Zhang
Title: QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
Abstract: Openvocabulary 3D object affordance grounding aims to identify functional regions of objects given arbitrary semantic descriptions. However, existing methods often rely on fixed training categories and geometric priors, lacking geometric invariance and analogical reasoning capabilities. Since there exists a significant domain gap when transferring affordance knowledge learned from 2D images to 3D point clouds, existing methods struggle to generalize well to objects with diverse shapes or unseen categories, and fail to perform effective category reasoning.To address these challenges, we proposeQueryMe, aQuery-driven framework that learns fromMultimodalevidence spaces to achieve open-vocabulary 3D affordance grounding.The proposed approach is to project human-object interaction images into 3D space, employ an Adaptive Spatial Attention module to focus on key interaction regions, and introduce a multimodal query structure to retrieve available geometrically consistent functional parts within the point cloud, effectively fusing visual, linguistic, and geometric cues.Leveraging attention-based query mechanisms, our method adaptively localizes affordance regions and performs analogy reasoning through geometric similarity, thereby exhibiting strong generalization to unseen scenes and objects. Experimental results demonstrate that QueryMe consistently outperforms state-of-the-art approaches, with the AUC improving by 4.19% compared to previous work for unseen affordance grounding tasks.
Paperid: 3492,   Poster  
Authors: Mingrui Zhu, Fengzhi Wang, Xin Wei, Jun Wang, Nannan Wang, Xinbo Gao
Title: When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
Abstract: Establishing accurate correspondence between sparse line representations and rich textured imagery remains a formidable challenge. While diffusion features excel in semantic correspondence, they struggle to bridge the fundamental gap between abstract sketches and texturerich photographs. We identify two critical disparities: spatial domain misalignment from structural abstraction differences, and frequency domain inconsistencies from texture density variations. Based on this analysis, we propose SFA-DIFT, a novel approach that learns spatial-frequency aligned diffusion features for robust cross-modal correspondence. Unlike previous methods focusing solely on spatial alignment, our key innovation performs dual-domain alignment by learning unified clean diffusion features while strategically aggregating low-frequency components in the frequency domain. This comprehensive spatial-frequency alignment enables equitable understanding between sparse abstractions and rich textures. To validate our approach, we extend the existing sketch-photo correspondence dataset (PSC6K) by generating multi-style textured imagery, creating MS-PSC6K, a comprehensive correspondence benchmark. Extensive experiments demonstrate that SFA-DIFT achieves state-of-the-art performance, delivering substantial improvements with an average of 0.87% on PCK@1, 2.20% on PCK@5, and 0.95% on PCK@10 over previous best methods, validating the effectiveness and robustness of our dual-domain alignment approach.
Paperid: 3493,   Poster  
Authors: Jiaxin Ai, Yukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, Kaipeng Zhang
Title: ProSoftArena: Evaluating Hierarchical Capabilities of Multimodal Agents in Professional Software Environments
Abstract: Multimodal agents are making rapid progress on general computeruse tasks, yet existing benchmarks remain largely confined to browsersand basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. We establish the first capability hierarchy tailored to agent use of professional software and construct a benchmark of 437 realistic work and research tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable and reproducible assessment, we build an executable real-computer environment with an execution-based evaluation framework and uniquely incorporate a human-in-the-loop evaluation paradigm. Extensive experiments show that even the best-performing agent attains only a 24.4% success rate on L2 tasks and completely fails on L3 multi-software workflow. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents in professional software settings.We will release all data and codes to forster research in this critical area.
Paperid: 3494,   Poster  
Authors: Hyeonjeong Park, Peixi Xiong, Xiaoqian Ruan, Dian Jia, Pei Yu, Wei Tang
Title: RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
Abstract: Monocular 3D object detection from a single RGB image remains challenging due to two fundamental challenges: the illposed nature of 3D localization, where multiple plausible configurations can correspond to the same 2D observation, and unreliable confidence estimation that fails to reflect true localization accuracy. Existing methods predict deterministic 3D boxes that often collapse to implausible mean estimates and rely on absolute confidence scores that are highly sensitive to localization errors. This paper introduces RARE, a unified framework that addresses both challenges through learning to rank and retrieve. RARE formulates confidence estimation as a ranking problem, learning to order detections by their relative quality rather than regressing absolute values. It provides more robust and stable confidence estimates that are less sensitive to localization uncertainty. Building on this improved confidence estimator, RARE learns to construct a query set for each object that predicts multiple diverse and plausible 3D configurations, and retrieves the top-ranked prediction. It explicitly models the multimodal nature of monocular 3D perception and produces more plausible localizations. Extensive experiments demonstrate the effectiveness of RARE. We will make the code publicly available.
Paperid: 3495,   Poster  
Authors: Xiaofeng Cong, Yu-Xin Zhang, Hao Shen, Yeying Jin, Junming Hou, Jie Gui
Title: SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control
Abstract: Underwater images often exhibit dominant bluegreen hues due to wavelength-dependent light attenuation. While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. SDUIE-Quant allows continuous, numerical adjustment of enhancement levels via low-rank adaptation weight merging within a dual-branch diffusion model. This model comprises a supervised branch trained on synthetic underwater-terrestrial pairs and a self-supervised branch designed to preserve the natural hues of real-world underwater scenes. Building on this, SDUIE-Text introduces intuitive, language-guided control by aligning semantic prompts with visual enhancement effects, leveraging the learned fusion weights. This dual-modality design offers both precise control and flexible, user-preferred enhancement. Experimental results demonstrate that SDUIE achieves state-of-the-art results while better preserving the aesthetic qualities often missed by conventional methods. The source code will be made publicly available.
Paperid: 3496,   Poster  
Authors: Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin F. Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao
Title: Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
Abstract: Voxelwise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP–HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training provides a robust and generalizable solution for RT planning across various clinical scenarios.
Paperid: 3497,   Poster  
Authors: Xingkui Zhu, Dingkang Liang, Cheng Chen, Daoxin Zhang, lv hanxiang, Zhe Xu, Yao Hu, Xiang Bai
Title: OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models
Abstract: Sparse activation layers, primarily Mixtureof-Experts (MoE) and memory-based modules, are a central approach for scaling large models and are gaining traction in vision tasks. Despite conceptual similarities, these paradigms have evolved independently, hindering systematic comparison and the development of modules that exploit their complementary strengths. To bridge this gap, we proposeOneSparse, a unified framework that reformulates MoE and memory modules under a common abstraction. This enables their systematic comparison and integration, revealing a continuous design space. Guided by this abstraction, we design theNexus Layer, which features two key innovations: a unified routing mechanism that merges the efficiency of memory retrieval with MoE's load balancing to ensure stable and scalable token assignment, and an adaptive processing strategy where memory modules sketches coarse representations while expert modules refine critical regions. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate that our Nexus Layer establishes a new performance efficiency frontier, surpassing representative sparse baselines on convolutional and transformer architectures. These results validate the power of the OneSparse framework to unify and integrate complementary sparse paradigms and underscores the potential of hybrid sparse modeling in vision.
Paperid: 3498,   Poster  
Authors: Youhan Sun, Jiahua Rao, Kangrui Du, Jiancong Xie, Yuedong Yang
Title: Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
Abstract: Spatial transcriptomics (ST) links gene expression to tissue architecture and enables predicting spatial expression from H&Estained whole-slide images (WSIs). However, existing spot- or slide-level predictors focus on single-spot features or pairwise relations, failing to capture high-order, many-to-many cross-cell interactions. As a result, they miss synergistic and antagonistic effects among multiple neighboring cells. Here, we introduce MCToGene, a scalable and accurate framework that explicitly models multi-cell interactions via many-body attention with hierarchical coupling to predict spatial gene expression. MCToGene employs a many-body attention module to encode high-order, many-to-many cross-cell dependencies, enabling context-aware microenvironment modeling. To mitigate the combinatorial burden of many-body modeling, we design a hierarchical interaction module that couples pairwise and many-body representations for feature aggregation and prediction, preserving many-body expressiveness while controlling computation and memory. On HEST-1k and STImage-1K4M, MCToGene surpasses state-of-the-art baselines with 7.85% relative improvement. Ablations confirm that explicit high-order, many-to-many modeling drives these gains, and visualizations demonstrate that multi-cell interactions is essential for biologically coherent spatial predictions.
Paperid: 3499,   Poster  
Authors: Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu
Title: TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
Abstract: Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume singleview inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human–scene–camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human–scene consistency.
Paperid: 3500,   Poster  
Authors: Avery gump, Connor Andrew Henley, Sungjin Cheong, Akarsh Prabhakara, Mohit Gupta
Title: Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
Abstract: Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultracompact, low-cost, solid-state arrays. This miniaturization—while enabling scalability, affordability, and camera-like data structures—introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical “ghosts in the point clouds.” This paper introduces a physically-grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator—the Transient Glare Spread Function (TGSF)—acting on the raw transient histogram cube. This formulation enables simple, training-free inversion directly in the measurement domain, before nonlinear point-cloud formation. We develop exact and approximate de-glare algorithms that are general, computationally efficient, and compatible with existing LiDAR data-processing pipelines. Using experiments with real single-photon LiDAR hardware, we demonstrate suppression of severe glare artifacts at millisecond latency, establishing de-glare as a practical, lightweight preprocessing step for next-generation LiDAR systems.
Paperid: 3501,   Poster  
Authors: Bin Pu, XUSHENG LIANG, Xinpeng Ding, Jinlin Wu, Zhen Lei, Shengli Li, Kenli Li, Jiawei Ma
Title: F$^2$-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination
Abstract: Forecasting fetal growth from sequential ultrasound examinations is essential for personalized prenatal care. Existing medical visionlanguage models (MLLMs) are limited to single-phase/organ evaluations and qualitative reasoning, neglecting longitudinal history and precise continuous biometric values. To address this gap, we introduce the novel task of multi-phase fetal growth forecasting and report generation. To support this task, we first present GrowthFetus, the largest multi-phase, multi-organ fetal ultrasound dataset to date, containing 9,280 examinations from 2,000 fetuses. Based on this dataset, we propose F^2-Assist, a unified MLLM framework with three key components: (i) a Cross-Phase Organ Alignment module for for heterogeneous multi-organ feature fusion across phases, (ii) a History-Aware Temporal Encoding module for modeling irregular temporal dynamics, and (iii) a Growth Parameter Adapter that encodes continuous biometric values as differentiable tokens for numerically precise reasoning. Extensive experiments show that F^2-Assist achieves temporally coherent predictions and clinically consistent reports, significantly outperforming state-of-the-art MLLMs. Our study establishes a practical framework for longitudinal ultrasound analysis, bridging growth forecasting and report generation in a unified model.
Paperid: 3502,   Poster  
Authors: Yang Li, Jia-Li Yin, Luojun Lin, Wei Lin
Title: Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models
Abstract: VisualLanguage Pre-training (VLP) models, while achieving state-of-the-art performance on various multimodal tasks, exhibit significant vulnerability to multimodal adversarial examples. In black-box attack scenarios of VLP models, a key challenge lies in the limited transferability of these adversarial examples. Existing methods to enhance transferability often suffer from an excessive dependence on the source model and a reliance on limited and fixed transformation techniques. To overcome these limitations, We propose a novel Transform to Transfer Attack (TTA) method. Our approach introduces a learnable transformation mechanism that adaptively selects optimal combinations of transformations to maximize input diversity, and incorporates integrated gradients to mitigate over-fitting on the source model, thereby refining the attack optimization process. Extensive experiments demonstrate that TTA achieves outstanding attack performance in downstream tasks, outperforming current state-of-the-art attack methods across different VLP architectures.
Paperid: 3503,   Poster  
Authors: Hanjie Xu, Yuanxing Duan, Qiyu Dai, Ge Li, Baoquan Chen, He Wang
Title: Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
Abstract: We address the challenge of reconstructing long dynamic scenes from multiview videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU memory requirements.To further compress the representation, we apply a series of techniques, Factorized Covariance Quantization, Layered Compression, and Residual Codebook Quantization, achieving a compression ratio of up to 22.3× while preserving high visual fidelity.We implement a highly optimized C++/CUDA framework for efficient training, compression, and real-time rendering, achieving over 500 FPS on an RTX 3090 GPU. Extensive experiments demonstrate the superior storage efficiency, visual quality, and rendering speed of L4DRotorGS, consistently outperforming prior methods in both quantitative metrics and perceptual quality on real-world long dynamic scenes.
Paperid: 3504,   Poster  
Authors: Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li
Title: RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
Abstract: Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. Meanwhile, to alleviate user intent ambiguity, RoadGIE introduces a topo-semantic instantiation during training to enhance interaction stability and consistency. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7 million parameters and real-time processing capabilities.
Paperid: 3505,   Poster  
Authors: Yiyan Zhu, Menghao Zhang, Haifeng Sun, Pengfei Ren, Xianao Chu, Chenye Xu, Hong Tan, Jinghan Wang, Qi Qi, Jingyu Wang
Title: Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection
Abstract: With the rise of pretrained vision–language models such as CLIP, performing video anomaly detection (VAD) through cross-modal reasoning has become an emerging trend. However, we observe that CLIP still suffers from weak abnormality awareness: normal and abnormal descriptions are highly entangled in the text embedding space, causing video features to assign nearly indistinguishable similarity scores to both types of prompts. To address this issue, we propose Alert-CLIP, an abnormality-aware latent–enhanced tuning framework that tailors CLIP for VAD. Alert-CLIP introduces a multi-level alignment strategy: (1) video–label alignment, which reshapes the semantic space to establish a coarse-level foundation for abnormality awareness; (2) region–text alignment, which explicitly associates anomaly-related regions with their detailed descriptions to strengthen fine-grained perception; and (3) region–semantic alignment, which further contrasts anomalous regions against multiple hard negative samples, enhancing abnormality-aware discrimination.Extensive experiments on four benchmarks demonstrate that Alert-CLIP consistently surpasses vanilla CLIP across supervised, zero-shot, and open-vocabulary settings, providing a solid foundation for future CLIP-based VAD research.
Paperid: 3506,   Poster  
Authors: Yixiao Song, Qingyong Li, Wen Wang, Zhicheng Yan
Title: PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
Abstract: Unsupervised point cloud segmentation is critical for embodied intelligence and autonomous driving, as it mitigates the prohibitive cost of dense pointlevel annotations required by fully supervised methods. Integrating 2D pre-trained models such as SAM to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer.​ To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS, highlighting its effectiveness in label-free 3D segmentation.​
Paperid: 3507,   Poster  
Authors: Jianhao Zheng, Liyuan Zhu, Zihan Zhu, Iro Armeni
Title: WildPose: A Unified Framework for Robust Pose Estimation in the Wild
Abstract: Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamicaware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update operator by integrating a frozen, pre-trained MASt3R feature backbone and training the operator's subsequent layers on a diverse curriculum of static and dynamic data. Second, we propose a high-capacity motion mask detector that leverages rich, multi-level 3D-aware features from the same frozen backbone. Extensive experiments show Wildpose consistently outperforms prior methods across a wide variety of benchmarks, including dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) datasets.
Paperid: 3508,   Poster  
Authors: Yibin Zhao, Yihan Pan, Jun Nan, Liwei Chen, Jianjun YI
Title: FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its highquality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for fast geometrically accurate reconstruction from free sparse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view feature supervision, along with differentiable camera parameters within 2 minutes.FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.
Paperid: 3509,   Poster  
Authors: Zhifang Liao, Junhao Li, HaoKang Ding, Yucheng Song
Title: DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
Abstract: Despite their impressive performance in multilabel classification of chest X-ray images (CXR), deep learning models are widely plagued by two types of spurious correlations: feature confounding arising from pathological co-occurrence and shortcut learning triggered by non-pathological visual confounders. These non-causal dependencies severely undermine the interpretability and robustness of models in real-world clinical settings. To address these challenges, we propose the Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification (DARC) framework, the first to synergistically decouple both types of confounding sources from a causal mechanism perspective. At the data level, we construct CheXconf, the first pixel-level annotation dataset of non-pathological visual confounders in CXR, comprising 40,213 annotated instances across 11 categories. This provides a solid foundation for accurately modeling these confounders. At the methodological level, we design a novel dual-stream causal learning architecture. Its Global Stream leverages the back-door adjustment criterion with CheXconf to explicitly block spurious paths from non-pathological confounders. Concurrently, the Local Stream employs counterfactual reasoning, constrained by anatomical priors, to disentangle the visual coupling of co-occurring pathologies. Experiments on large-scale public benchmarks demonstrate that our method achieves significant improvements in task performance, interpretability, and robustness. All codes and datasets will be made publicly available upon the publication of this paper.
Paperid: 3510,   Poster  
Authors: Yang Li, Yuchen Liu, Haoyu Lu, ZhiqiangXia ZhiqiangXia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang
Title: GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, crossmodal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on real-device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and end-to-end performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a reproducible and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.
Paperid: 3511,   Poster  
Authors: Wenjie Chang, Tianle Ding, Wenfei Yang, Tianzhu Zhang
Title: SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
Abstract: Gaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or comoving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting–based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the underlying unlit scene. Furthermore, instead of adjusting the illumination of each individual Gaussian as in prior work, we employ a tile-based shading scheme that operates directly on the rendered images, greatly reducing computational cost while explicitly separating illumination from intrinsic scene appearance.Additionally, we further refine the learned Gaussian representation by combining the recovered unlit scene appearance with an advanced geometric prior model, which significantly improves geometric accuracy.Experimental results demonstrate that our method achieves superior reconstruction quality in challenging environments compared to state-of-the-art techniques.
Paperid: 3512,   Poster  
Authors: Wenxuan Wang, Chenglei Wang, Chengzhi Yan, Xuelin Qian, Yanning Zhang
Title: Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
Abstract: Deep learning models remain highly vulnerable to evolving adversarial attacks. While existing continual adversarial training approaches often assume abundant adversarial data at each stage, realworld scenarios frequently involve limited data availability. This paper addresses the setting of Few-shot Continual Adversarial Training, where only a small number of adversarial examples are available per stage, presenting major challenges in achieving robust generalization and mitigating catastrophic forgetting. To tackle these challenges, we propose a novel continual adversarial training framework that incorporates three key components: (i) an Adversarial Margin loss that explicitly pushes clean samples away from decision boundaries to enhance feature discrimination; (ii) a Gaussian mixture model Prototype Replay strategy that synthesizes representative pseudo-features to preserve knowledge of past adversarial domains; and (iii) a Multi-Domain Balanced loss that guides updates to stabilize learning across diverse attack distributions. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that our approach consistently outperforms state-of-the-art methods in both clean and robust accuracy across a variety of adversarial settings. The code will be released.
Paperid: 3513,   Poster  
Authors: Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan
Title: UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
Abstract: We present UniGen1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit benchmarks that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
Paperid: 3514,   Poster  
Authors: Alex Wang, Zhiwei Dong, Qicheng Bai, Chenshi Zhang, Yujie Yi, Guang Dai, Yong Liu, Mengmeng Wang
Title: DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
Abstract: Recent generative models allow robots to generate future visual outcomes for action guidance, yet most still address imagination and control independently, resulting in visually coherent rollouts but physically inconsistent behaviors. While structural priors enhance spatial grounding, these methods remain visually correlationdriven rather than causally informed, overlooking the bidirectional coupling between robot actions and the evolving environment. We formalize the coupling as interaction dynamics, which specify where environmental changes occur and how actions cause them. Based on this formulation, we introduce DynBridge, an end-to-end framework that unifies imagination and control through the shared dynamics representation. Specifically, DynBridge realizes this via three components: (1) an Interaction Dynamics Generator that forecasts interaction dynamics via joint trajectory generation and action prediction; (2) an Action-Conditioned Dynamics Aggregator that integrates dynamics under control signals; and (3) a Dynamics-Guided Action Predictor that leverages the aggregated dynamics to produce executable, context-aware actions. Results demonstrate that DynBridge consistently outperforms prior methods on simulated and real-world benchmarks without external pretraining.
Paperid: 3515,   Poster  
Authors: Wenqi Jia, Ruifan Li, Pengyue Lin, Fangxiang Feng, Zhanyu Ma, Xiaojie Wang
Title: Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding
Abstract: The task of visual grounding (i.e., VG) aims to locate or segment objects in images based on referring expressions. Existing research on VG primarily focuses on large objects. However, these images often contain objects at various scales. Although large objects are usually the visual focus, small objects sometimes carry crucial information. To bridge the gap, we propose a novel benchmark for small object visual grounding, i.e., SoVG. Specifically, we introduce an automatic pipeline using MLLMs to build a benchmark dataset. Our pipeline is built on the popular dataset COCO. Thus, we obtain our RefCOCOs dataset. The visual objects in our RefCOCOs have an average area of 1/50 area of an entire image, whereas that of classic VG datasets is 1/5. Furthermore, we propose SoVGNet with a hierarchical textual infusion module for the novel SoVG task. Finally, we conduct extensive experiments using classic datasets with our RefCOCOs. The results showcase that our built dataset is useful for advancing VG research, and our proposed SoVG-Net is a strong baseline. Our dataset and code will be made publicly available after review.
Paperid: 3516,   Poster  
Authors: Lei Yao, Yong Chen, YUEJIAO SU, Yi Wang, Moyun Liu, Lap-Pui Chau
Title: HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
Abstract: Humans commonly reason about object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intentiondriven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the reference image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark reveal the superiority and robustness of our method in seen and unseen scenarios compared to existing approaches. The code and models will be made publicly available.
Paperid: 3517,   Poster  
Authors: Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai
Title: Think, Then Verify: A Hypothesis–Verification Multi-Agent Framework for Long Video Understanding
Abstract: Long video understanding is challenging due to dense visual redundancy, longrange temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis–verification process.Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer.Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.
Paperid: 3518,   Poster  
Authors: Yan Li, Yuzhu Shi, Kan Zhou, Shu Zhang, Diqi He, Dingwen Zhang, Junwei Han
Title: Few-Shot Hybrid Incremental Learning:Continually Learning under Data Scarcity and Task Uncertainty
Abstract: The increasing complexity of realworld deployment requires intelligent agents to effectively adapt to non-stationary data streams with stochastic increments under data scarcity. We formally define this challenge as the Few-Shot Hybrid Incremental Learning (FSHIL) paradigm, which reveals a critical stability-plasticity dilemma. Existing strategies struggle to address this dilemma: representation freezing in few-shot incremental learning can mitigate overfitting under data scarcity but leads to insufficient representation plasticity, while architecture expansion in hybrid incremental learning provides plasticity for adaptation but results in overfitting under few-shot conditions. To address this, we propose the Conditional Meta-Expanding Mixture-of-Experts (CME-MoE), which balances feature-level stability-plasticity trade-off through conditional expert reuse and meta-expansion mechanism. Furthermore, recognizing the multi-domain manifestation in the latent space, we introduce the Self-Expanding Prototype Classifier (SEPC), which on-demand expands classification to model complex domain-shifted decision boundaries. The proposed method outperforms existing state-of-the-art methods in three few-shot incremental learning settings across five mainstream datasets, effectively addressing data scarcity and task uncertainty, and providing a robust solution for real-world continual learning.
Paperid: 3519,   Poster  
Authors: Haiwei Wu, Kemou Li, Yuanman Li, Jiantao Zhou
Title: Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training
Abstract: Digital image forensics can ensure information credibility in tasks like camera source identification (CSI), synthetic image detection (SID), and social network provenance (SNP). These tasks typically rely on image processing history clues left by incamera operations, post-capture editing, or synthetic generation. However, most existing forensic methods have obvious limitations: 1) they often only focus on camera-specific traces (e.g., the well-known PRNU), and 2) they demand a substantial amount of annotated training data. To address these constraints, we propose Editprint, a novel general forensic feature that captures highly diverse in- and out-camera processing history clues with minimal unlabeled training data. Ideally, we expect that any images undergoing the same imaging, editing, and transmission processes would yield identical Editprints, and vice versa. To model the in- and out-camera operations, we devise an online editing pool based on self-augmentation strategies. Requiring only minimal (e.g., 10) training data, the editing pool can simulate massive (e.g., 10^\text7) editing chains and traces arising from the in-camera processing and the subsequent out-camera operations. To ensure that Editprint exhibits high discriminative capabilities across various editing chains, we propose using textual descriptions of these chains as labels and supervising their Editprints through language-guided contrastive learning. Extensive experiments show Editprint outperforms existing self-supervised forensics, particularly in non-camera applications such as SNP and SID. We hope that Editprint would inspire the forensic community and serve as a novel benchmark for self-supervised forensics.
Paperid: 3520,   Poster  
Authors: Chaolang Li, Pengwen Dai, Jingyu Li, Siyuan Yao, Yuchen Jiang, Zhuoran Zheng
Title: DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection
Abstract: Multimodal tiny object detection plays a critical role in realworld applications. However, detecting tiny objects remains challenging due to environmental complexities. While recent methods leverage spatial multi-scale representations or frequency-domain enhancements, most focus solely on visible images and overlook complementary multimodal frequency cues. This paper explores how to effectively harness cross-modal frequency information for infrared–visible tiny object detection. Through frequency characteristic analysis, we observe that tiny objects exhibit rich mid- and high-frequency energy across both modalities, motivating the design of a Dynamic Frequency-decoupled Cross-modal Learning Transformer (DyFCLT). Our approach introduces a Dynamic Frequency-Band Decoupled Cross-Modal Attention (DFCA) mechanism to extract and interact frequency components across modalities. To suppress noise while enhancing foreground signals, a Selective Smoothing Enhancement (SSE) strategy is proposed, which smoothes background interference and guides multi-scale feature fusion. DFCA and SSE collaborate to achieve synergistic enrichment and refinement of cross-modal features. Extensive experiments on two tiny-object benchmarks and one general-scale benchmark demonstrate that DyFCLT sets new state-of-the-art results, outperforming prior leading methods by significant margins and exhibiting strong generalization across scales and scenarios.
Paperid: 3521,   Poster  
Authors: Sunghyun Park, Jeongho Kim, Hyoungwoo Park, Debasmit Das, Sungrack Yun, Munawar Hayat, Jaegul Choo, Fatih Porikli, Seokeon Choi
Title: Memory-Efficient Fine-Tuning Diffusion Transformer via Dynamic Patch Sampling and Block Skipping
Abstract: Diffusion Transformers (DiTs) have significantly enhanced textto-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.
Paperid: 3522,   Poster  
Authors: Xuewei Cao, Jiayue Yang, Zhiwen Zeng, Yanyong Zhang, Yan Xia
Title: C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition
Abstract: LiDARbased place recognition is highly sensitive to rain, snow, and fog, where scattering and attenuation distort geometric structure and intensity. We tackle this problem with Conditional Latent Velocity Field (C-LaV) denoising, which restores weather-robust representations before retrieval. Single-sweep point clouds are projected into three-channel bird’s-eye-view (BEV) images and encoded with a frozen DINOv2-based BEV transformer to obtain a semantically anchored latent space shared across weather conditions. On this manifold, a conditional Flow Matching model learns a velocity field whose probability-flow ordinary differential equation (ODE) deterministically transports noisy latents toward their clear-weather counterparts. From the denoised manifold, a Sinkhorn Aggregation of Local Descriptors (SALAD) head produces compact global descriptors optimized with a truncated Smooth-AP loss. We also establish a unified adverse-weather benchmark with 3 m frame spacing and shared evaluation thresholds across KITTI, NCLT, and Boreas datasets. Under this protocol, C-LaV improves Recall@1 by 17.5% on NCLT snow and 21.5% on Boreas, achieving state-of-the-art weather robustness. Our dataset and code will be publicly available.
Paperid: 3523,   Poster  
Authors: Xinshun Wang, Peiming Li, Ziyi Wang, Zhongbin Fang, Zhichao Deng, Songtao Wu, Xiangtai Li, Mengyuan Liu
Title: Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
Abstract: Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion inbetweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between "perception" models that understand motion from video but only output text, and "generation" models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons. Code is in the supplementary material.
Paperid: 3524,   Poster  
Authors: Sun Siyi, Jinliang Lin, Juanjuan Weng, Zhihui Liu, Shaozi Li, Zhiming Luo
Title: COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
Abstract: Occlusion presents two critical challenges for person reidentification (Re-ID): feature interference and information loss. While existing efforts have explored occlusion-aware data augmentation and feature reconstruction to mitigate these issues, the former often fails to address erroneous matches caused by similar occlusion patterns and background distractions, whereas the latter typically introduces significant computational overhead. To overcome these limitations, we propose a Consistent Occlusion and Prompt Enhancement (COPE) network. COPE incorporates a Cross-Identity Consistent Occlusion (CICO) module that applies identical occlusions across different identities and encourages feature similarity in the same occluded regions across different identities to reduce occlusion feature interference. A Prompt Background Filling (PBF) module leverages vision-language alignment to generate foreground heatmaps and performs random background filling, enhancing feature robustness under varying backgrounds. Additionally, a lightweight Prompt Similarity Scoring (PSS) module refines retrieval similarity by utilizing prompt-guided reliability scores. Extensive experiments on both occluded and holistic Re-ID benchmarks demonstrate that COPE consistently outperforms existing methods. Notably, it achieves 82.4% Rank-1 accuracy and 76.4% mAP on the challenging Occluded-Duke dataset.
Paperid: 3525,   Poster  
Authors: xucong wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Rui Mao, Yang Wang
Title: LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation
Abstract: Prompt Learning (PL) has emerged as a parameterefficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., \le 224×224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness. To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.
Paperid: 3526,   Poster  
Authors: Qiongjie Cui, Pan Zhou, Jingjing Chen, Na Zhao
Title: Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction
Abstract: The research frontier in human pose prediction (HPP) is advancing toward continual testtime adaptation (TTA), where models must self-adapt to dynamic test distributions. To date, the homeostatic continual TTA remains the sole viable solution, which isolates the model parameters and update domain-sensitive ones. Despite mitigating full-body domain gaps, human anatomical heterogeneity (domain shifts often localize to specific regions) is ignored. This anatomical-agnostic approach forces uniform parameter adaptation across kinematically distinct segments, causing: over-adaptation of stable regions and under-adaptation of shift-prone articulations. To address it, we introduce TT-HA, a novel Test-Time Heterogeneous Adaptation that implicitly estimates domain changes for anatomical segments, and adapt the corresponding parameters. Building on human anatomy, TT-HA partitions parameters into five anatomical subsets using fisher information matrix-based parameters uncertainty analysis. During testing, TT-HA uses the instance normalization statistics and Earth Mover's Distance (EMD) to quantify segment-wise domain changes, dynamically determining which segment-specific parameters to adapt and to what extent. When substantial domain shifts are detected, TT-HA restores only affected segments to source-trained values, ensuring robust adaptation without full parameter resetting; minor shifts trigger the fine-tuning of corresponding parameters while preserving remaining ones. Experiments show TT-HA's superior full-body accuracy with greater limb error decrease than prior methods, proving its anatomically-targeted efficacy.
Paperid: 3527,   Poster  
Authors: Yifan Yang, Juntuo Wang, Yuming Qiao, Xudong Zhang, Chunyang Yu, Yan Li, Xiao Lin, Liang Luo, Dan Meng
Title: JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning
Abstract: The value of a photograph lies not in what it contains, but in what it is about. John SzarkowskiWith the advancement of Vision-Language Models (VLMs), employing VLM-as-a-Judge for visual evaluation has become a widely adopted metric in vision research. However, existing VLM-as-a-Judge approaches suffer from biased scoring outcomes with low discrimination and lack the capacity for unified multi-attribute compositional assessment. To address these limitations, we propose a novel training paradigm, termed JoPPO (JointProbabilisticPolicyOptimization) that enables the VLMs to learn ranking under compositional assessment constraints. We evaluate the JoPPO on image aesthetics as a testbed, a task requiring nuanced understanding of multiple attributes including composition, lighting, color and geometry. Training follows two stages: (1) Supervised Fine-Tuning (SFT) on synthetic composition dataset provided by automated data generation pipeline to instill compositional priors; and (2) Contrastive Joint Conditional Probabilistic Reinforcement Learning: building upon the GRPO algorithm, we introduce JoPPO, which compute reward based on the expected win rate of total scores derived from the conditional distribution of fine-grained attribute scores within batches, effectively enhancing the model’s discriminative ability in composite evaluation. Across standard aesthetic benchmarks, our method achieves consistent improvements in ranking consistency, demonstrating strong zero-shot generalization.
Paperid: 3528,   Poster  
Authors: Kun yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung
Title: From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
Abstract: There is growing interest in biomedical visionlanguage models trained on scientific literature. However, most pipelines compress rich multi-panel figures and long captions into coarse figure-level pairs, discarding the fine-grained correspondences clinicians rely on when zooming into local structures. We introduce Panel2Patch, a data pipeline that mines hierarchical structure from multi-panel, marker-heavy biomedical figures and their surrounding text, and converts them into multi-granular supervision. Given figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs aligned image--text pairs at the figure, panel, and region levels, preserving local semantics instead of treating each figure as a single sample. Built on this corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases in a shared embedding space. Applying Panel2Patch to a small subset of literature figures yields substantially better performance than prior pipelines, demonstrating that exploiting hierarchical figure structure can provide more effective supervision with less pretraining data.
Paperid: 3529,   Poster  
Authors: Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Title: DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the longtail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
Paperid: 3530,   Poster  
Authors: Paul Roetzer, Anders Johan Thunberg, Zorah Lähner, Florian Bernard
Title: Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
Abstract: In many real world applications of nonrigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which gives rise to theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that the alpha expansion algorithm can be customised to gain significant speed-ups, while maintaining its approximation guarantees. We test our formalism on various shape matching datasets including settings in which shapes have topological artefacts.
Paperid: 3531,   Poster  
Authors: Bao Truong, Quang Nguyen, Baoru Huang, Jinpei Han, Van Nguyen, Ngan Le, Minh-Tan Pham, Doan Hien, Anh Nguyen
Title: SIGMA: A Physics-Informed Benchmark for Gas Chimney Understanding in Seismic Images
Abstract: Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physicsbased methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce SIGMA, a new physics-informed dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.
Paperid: 3532,   Poster  
Authors: Xuan Wang, Guiguang Ding, Jungong Han
Title: PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models
Abstract: Continual Learning (CL) enables VisionLanguage Models (VLMs) to acquire new capabilities while retaining prior knowledge, for example, by employing task‑specific adapters. Existing CL approaches typically optimize these adapters to convergence, often with (near-)orthogonality constraints to reduce interference; however, isolating adapters in orthogonal subspaces can suppress cross‑task transfer and sharing. To address this problem, we provide a new perspective based on PAC-Bayesian analysis: once the per‑task optimization has converged, adapters should be further shaped to satisfy \underlinePhase‑like tr\underlineAnsition \underlineCons\underlineTraints (PACT) -- a two-part formulation that (i) specifies a phase‑like transition relation among adapters and (ii) imposes explicit constraints that enforce this relation. Under PACT, adapter dynamics resemble the phase transition of water: the system gravitates toward either a “frozen” (history‑preserving, tightly constrained) or a “melted” (task‑adaptive, free) regime, while moving between them smoothly rather than via hard thresholds. We operationalize PACT by coupling stability and plasticity regularizers within a two‑branch Vision Transformer (ViT), seeding adapters with a Stable Adapter Initialization (SAI), and introducing a Prior Anchoring (PA) mechanism, thereby inducing phase‑like adapter dynamics. Across diverse CL settings, PACT surpasses state‑of‑the‑art methods while reducing the number of trainable parameters by 36.96% relative to standard adapter‑based baselines. Our code will be released publicly.
Paperid: 3533,   Poster  
Authors: Hankyeol Lee, WOOYEOL BAEK, Seongdo Kim, Jongyoo Kim
Title: REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
Abstract: Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a twostage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, guided by the prior's geometric cues and the backbone's pretrained 3D knowledge. By initializing the process with the encoded latent of a source mesh instead of the prior, the framework also supports 3D editing conditioned on an edited image. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.
Paperid: 3534,   Poster  
Authors: Shehreen Azad, Vibhav Vineet, Yogesh Rawat
Title: StreamReady: Learning *What* to Answer and *When* in Long Streaming Videos
Abstract: Streaming video understanding often involves timesensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with theAnswer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduceStreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduceProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
Paperid: 3535,   Poster  
Authors: Xihang Qiu, Yuhao Fang, Qing Zhou, Bin Zhai, Jialong Hong, Wanpeng Zhang, Yao Lu, Ye Zhang, Chun Li
Title: Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition
Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to understand emotions expressed in each utterance by effectively integrating audio, text, and visual modalities. However, in realworld scenarios, unavoidable missing modalities often degrade multimodal interpretation performance. To address this, we propose Hypergraph Diffusion and Evidence Fusion based Emotion Recognition (HyperEF), a novel framework designed to mitigate challenges arising from incomplete modalities in MERC. Specifically, to mitigate performance degradation caused by modality absence, we propose Masked Hypergraph Attention (MHGAT) conditioned diffusion model to recover latent features of missing modalities in the latent space. To ensure semantic consistency between recovered and available modalities within the same utterance, we introduce MHGAT that captures high-order semantic information from available modalities to guide the diffusion model’s denoising process. Furthermore, to disentangle and model the complex uncertainties inherent in MERC, we propose Dual Channel Evidence Fusion (DCEF), which estimates uncertainty at both feature source level and discriminative level, thereby achieving adaptive evidence fusion. Extensive comparative experiments and interpretability demonstrate the superior performance of our model in emotion recognition, as well as the contribution of each module within the model.
Paperid: 3536,   Poster  
Authors: Xingjian Jiang, Lishun Wang, Ping Wang, Xin Yuan
Title: DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
Abstract: Video snapshot compressive imaging (SCI) offers a promising alternative to highspeed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial–temporal aliasing introduced by coded exposure. To address this challenge, we proposes DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Modules (MIM) for progressive feature refinement, and a Frequency Mamba (FM) module dedicated to frequency-aware query selection. MIM enhances features via multi-scale dilated convolutions and implicit representations, while FM restores discriminative details by decomposing and reweighting frequency bands. Experiments on the SportsMOT dataset show that DetectSCI achieves 80.9 Average Precision (AP), surpassing the best CNN-based detector by at least 2.8 AP and the best Transformer-based detector by at least 4.1 AP, while maintaining comparable efficiency. Code will be released.
Paperid: 3537,   Poster  
Authors: Takeshi Noda, Yu-Shen Liu, Zhizhong Han
Title: 3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction
Abstract: Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a selfconstrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more recent depth images which are usually more accurate, and progressively narrow the band to tighten the constraint on Gaussian movements. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.
Paperid: 3538,   Poster  
Authors: Chaohu Liu, Shida Wang, Yubo Wang, Linli Xu
Title: TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising progress in table reasoning from visual table inputs. Despite their ability to capture rich visual cues such as color and layout, MLLMs still underperform compared to textonly models.We argue that a major limitation lies in the pre-training process, which inadvertently weakens the model’s intrinsic reasoning ability and consequently hinders the effectiveness of reinforcement fine-tuning on table reasoning tasks.In this paper, we introduce TableMix, a novel framework that tackles this challenge from a data-centric perspective. At the core of TableMix is a principled data mixing strategy. Specifically, TableMix constructs a hybrid dataset that combines: (1) multimodal table reasoning data to improve task-specific reasoning, (2) text-only mathematical reasoning data to revive the model’s logical competence, and (3) simple multimodal perception data to preserve visual grounding.Recognizing the non-uniform difficulty of mixed data, we further propose a Difficulty-Aware Reward Shaping (DRS) mechanism, which enables the Group Relative Policy Optimization (GRPO) algorithm to adaptively reward concise reasoning for easy problems while encouraging more elaborate reasoning for complex ones, thereby reducing redundant computation and errors.Extensive experiments show that TableMix markedly enhances the reasoning ability of MLLMs, outperforming strong multimodal baselines and even rivaling state-of-the-art text-only models.
Paperid: 3539,   Poster  
Authors: Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun, Hao Chen, Chunhua Shen
Title: Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Abstract: Recent advancements in feedforward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that enables high-resolution geometry estimation. These contributions are integrated into CFG, a model that simultaneously generates precise and coherent geometric representations from diverse input perspectives at high resolutions. Comprehensive testing across multiple benchmarks for point cloud reconstruction, video depth estimation, and camera pose/intrinsic parameter estimation confirms CFG's superior performance, establishing it as a state-of-the-art solution for visual geometry tasks.
Paperid: 3540,   Poster  
Authors: Qingji Dong, Hang Dong, Mingqin Chen, Rui Zhang, Yitong Wang
Title: DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
Abstract: Largescale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies.To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches.We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model’s capability to capture patch information and effectively restore local textures.Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results.
Paperid: 3541,   Poster  
Authors: Feng Ye, Kai Zhang, Li zhang, Chuanmin Jia
Title: High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
Abstract: Exploiting bidirectional context prediction has long been recognized as a key direction for improving compression efficiency in neural video coding. However, existing neural B-frame codecs still exhibit limited performance gains, particularly in high-resolution videos with large motion, where optical flow estimation becomes unreliable and balanced prediction fusion introduces distortions. To address these challenges, we present the first High-Resolution bi-directional neural video coding method, termed as HR-NVC, which non-uniformly integrates confidence-guided predictive cues from both temporal directions to achieve more reliable and efficient compression. Specifically, we propose Spatio-Temporal Anchored Motion Estimation, which introduces virtual anchor frames and low-resolution priors to significantly improve estimation robustness under large displacements. We further design a Hierarchical Motion Representation that converges multi-scale motion with temporal references, enabling compact and adaptive modeling of motion reliability across resolutions. Finally, a Bi-Contextual Asymmetric Harmonization module performs confidence-guided fusion of bidirectional references, effectively suppressing unreliable contexts and restoring structural consistency near occlusion and scene transition regions. Notably, our model is the first end-to-end-optimized video codec evaluated on 4K-resolution videos, establishing a new benchmark for higher-resolution NVC and achieving state-of-the-art performance among neural B-frame codecs.
Paperid: 3542,   Poster  
Authors: Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian, Zhongbin Guo
Title: Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Abstract: Large language models (LLMs) often generate selfcontradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method calledTemporallyConditionedAttentionSharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.
Paperid: 3543,   Poster  
Authors: wang duanchu, Junjie Yang, Haoran Gong, Jing Liu, Di Wang
Title: CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy
Abstract: Transformerbased approaches have recently become the dominant paradigm for 3D instance segmentation. These methods typically employ a multi-layer decoder that iteratively refines a set of learnable queries into instance mask predictions. However, we observe that multiple queries often target the same instance simultaneously, leading to fragmented masks for a single object. We define this phenomenon as \emphinter-query competition, which slows convergence and limits segmentation accuracy. To address this problem, we present CompetitorFormer, a novel framework designed for Transformer-based methods. Our method mitigates inter-query competition by explicitly modeling the competitive relationships among queries. Specifically, we introduce a \emphQuery Competition Layer before each decoder stage to construct a dynamic competitive landscape, allowing each query to perceive its relative importance. In addition, the proposed \emphRelative Relationship Encoding and \emphRank Cross-Attention modules enhance both self-attention and cross-attention by prioritizing dominant queries. Extensive experiments show that our approach converges faster and achieves superior performance on the ScanNetV2, ScanNet++V2, ScanNet200, and S3DIS datasets.
Paperid: 3544,   Poster  
Authors: Ruichi Zhang, Chikai Shang, jiacheng yang, Mengke Li, Yang Zhou, Junlong Gao, Yang Lu
Title: CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
Abstract: Longtailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminality. To address this, we propose CUE, \underlineConcept-aware m\underlineUlti-label \underlineExpansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by (i) extracting instance-level visual cues from zero-shot CLIP and (ii) generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustmen (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustmen (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. The code is available in the supplementary materials.
Paperid: 3545,   Poster  
Authors: Yiqian Chang, Qinghong Ye, Haoran Xu, Jianing Li, Dongyang Ma, Xuan Wang, Wei Zhang, Yonghong Tian, Peixi Peng
Title: MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras
Abstract: This paper proposes the first task for highspeed 3D point tracking using multi-view Event-RGB hybrid cameras. We design a cuboid observation device comprising 4 RGB cameras (30fps) and 2 Event cameras to synchronously capture high-speed motions, and propose MER-Tracker, a high–frame-rate 3D point-tracking network that fuses the complementary strengths of dual modalities. We first respectively extract 2D motion-change features from the RGB and Event modalities, then apply linear interpolation and anchor sampling to fuse the discrete RGB 3D features and continuous Event 3D features after 3D lifting, and finally employ a LoRA-tuned Transformer based on temporal correlationship to predict the high-frame-rate 3D point trajectories over fast motions, accomplishing high-speed 3D point tracking. To verify the effectiveness of our method, we construct both real-world and simulated high-speed motion datasets. Experiments on these datasets show that our method achieves accurate high-speed 3D point tracking at high-frame-rate (150fps), outperforming state-of-the-art methods.
Paperid: 3546,   Poster  
Authors: Chengcan Qian, Dong Nie, Geng Chen, Daoqiang Zhang, Xuyun Wen
Title: Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation
Abstract: Medical image segmentation is challenging due to limited annotated data, high labeling costs, and substantial image heterogeneity. Although largescale vision foundation models (e.g., SAM) have shown great potential in this field, existing SAM-based methods typically rely on expert-defined geometric prompts or complex clinical text prompts, which limits their generalizability across diverse medical image segmentation tasks. To overcome these challenges, we propose Simple-ViLMedSAM, a CLIP-SAM integration framework that enables high-accuracy segmentation in zero-shot and few-shot settings using only simple text queries, that is, using only basic anatomical or disease-related text labels. At its core is an Implicit Pos-Prompter (IPP), which generates attribution maps containing implicit positional cues to replace traditional geometric prompts. IPP incorporates a multi-modal information bottleneck and an affinity-based refinement strategy to ensure high-quality guidance from CLIP-SAM interactions. To further enhance segmentation, we introduce a Bidirectional Interaction Decoder (BID) that employs bidirectional cross-attention to align IPP’s positional maps with SAM's pixel-level features. By jointly modeling global semantics and local details, BID significantly improves segmentation accuracy. Extensive experiments on four public datasets demonstrate that Simple-ViLMedSAM consistently outperforms existing methods in both zero-shot and few-shot medical image segmentation tasks, using only simple text queries. The code will be publicly available upon acceptance.
Paperid: 3547,   Poster  
Authors: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Lewei Lu
Title: SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning
Abstract: VisionLanguage Models (VLMs) are limited by static knowledge and insufficient fine-grained visual analysis, hindering their performance on knowledge-intensive and visually complex tasks. While recent research has explored VLMs that employ external tools like search or cropping to enhance model performance, they typically employ tools in isolation and lack the ability to coordinate multiple tools effectively. To address this gap, we propose SenseSearch, the first agentic VLM for search-reasoning that supports adaptive multi-tool coordination via reinforcement learning (RL). Specifically, SenseSearch dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. We first construct a high-quality cold-start dataset to instill basic tool-usage behaviors. In the subsequent RL stage, we introduce Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to enhance the tool invocation and reasoning ability. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseSearch achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks, outperforming baselines by 19.18% on HR-MMSearch. SenseSearch provides a promising path toward agentic VLMs with effective and robust tool invocation capabilities. All code and data will be publicly released.
Paperid: 3548,   Poster  
Authors: Yang Xu, Yiwei Bao, Feng Lu
Title: Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
Abstract: Deep learningbased gaze estimation methods often exhibit significant performance degradation on unseen target domains. Through systematic frequency-domain analysis, we reveal that face images contain frequency components with distinct contributions: some facilitate cross-domain generalization while others introduce domain-specific interference that impedes it, with both components varying across datasets and constituting a key source of domain gap. Based on these observations, we propose the Frequency-Guided Adaptive Learning framework (FGAL), a novel framework enhancing domain generalization without accessing target domain data. The FGAL consists of two complementary modules: the Adaptive Interference Suppression Module (AISM) and the Spectrum Diversification Module (SDM). AISM adaptively suppresses sample-specific interfering frequency components through learnable modulation maps, while SDM diversifies frequency distribution patterns to enhance robustness against cross-domain variations. Experiments demonstrate that FGAL achieves substantial improvements, outperforming baselines by up to 28.2% and state-of-the-art methods by up to 19.5% across multiple cross-domain settings, demonstrating our framework's potential for broader domain generalization tasks.
Paperid: 3549,   Poster  
Authors: Ziyi Wang, ZhangYang ZhangYang, Guijian Tang, Chao Zhang, Shibo Zhang, Xueqiong Li, Shaowu Yang
Title: RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation
Abstract: Vehicle trajectory prediction is a critical technology for safe and efficient autonomous driving.However, its generalization and scalability have long been hindered by a heavy reliance on realtime, online priors.To break this bottleneck, we introduce RAG-TP, a general framework that reframes the problem from relying on uncertain online perception to retrieving from a large-scale, structured, offline knowledge base.The core of RAG-TP is to enhance predictions at inference time by dynamically querying a pre-built, heterogeneous knowledge base rich with scene topologies and motion patterns, using the retrieved historical experiences as priors.We further design a dynamic fusion module based on a learnable Mixture-of-Experts (MoE), which intelligently weights and integrates the multi-source retrieved knowledge via cross-attention to generate a high-density context for the final multi-modal trajectory decoding.By decoupling online inference from offline knowledge, this retrieval-augmented approach grounds predictions in a vast structured database, thereby mitigating model hallucination and compensating for unreliable priors to significantly enhance robustness and domain adaptation.Extensive experiments demonstrate that RAG-TP achieves excellent performance in both map-based and map-free settings, surpassing existing map-free methods while achieving performance comparable to state-of-the-art (SOTA) map-based models.It demonstrates significant advantages, particularly in cross-domain and zero-shot generalization tasks.Our work provides a promising and effective technical pathway toward building more scalable and robust prediction systems for autonomous driving.
Paperid: 3550,   Poster  
Authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang
Title: CrossAgent: Bridging Cross-level Actions into One Agentic Model via Reinforcement Learning
Abstract: Autonomous endto-end agents are increasingly required to operate in environments where actions are not derived directly from the environment's raw actions but instead selected from higher-level action spaces. These actions are then mapped to the corresponding low-level interactions with the environment through controllers. In existing research, the action space is typically predefined. However, in practice, the optimal action space is context-dependent and difficult to determine in advance. For example, in complex domains such as Minecraft, relying solely on low-level raw actions or high-level planning actions is insufficient to handle the wide range of open-ended tasks, which vary in complexity and time horizons. The effective granularity of the control inevitably varies depending on the situation.To address this challenge, we propose CrossAgent, which introduces a novel adaptive action-space selection framework. CrossAgent is built through two stages of reinforcement learning fine-tuning: cold-start single-step reinforcement learning and multi-step reinforcement learning. Within Minecraft, we define three complementary action spaces: motion, grounding, and raw action—each with distinct advantages and limitations. Our framework enables agents to dynamically switch among these spaces and balance task rewards against reasoning costs.Experiments on over 30 diverse tasks in Minecraft demonstrate that CrossAgent exhibits strong long-horizon planning, precise execution, generalization, and efficiency, significantly outperforming fixed-action baselines. These results highlight the critical role of dynamic action-space adaptation in the development of generalist agents capable of tackling open-ended environments.
Paperid: 3551,   Poster  
Authors: Deng Maijie, Yuhua Li, Yixiong Zou, Yao Wu, Chenru Ma
Title: Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
Abstract: Dataset quantization has recently emerged as a promising solution for mitigating the computational and memory challenges of largescale datasets. However, existing approaches rely on a bin generation step that is computationally expensive and inefficient for large-scale datasets. Moreover, a fixed drop ratio in its patch dropping step fails to adapt to the diverse redundancy levels across samples, which degrades the representational quality of the quantized coreset. To address these limitations, we present Bin-Generation-Free Dataset Quantization (BGFDQ), a fully restructured framework that incorporates a simple yet effective KNN-based neighbor identification and neighbor-aware coreset selection strategy. We theoretically demonstrate that the proposed selection strategy achieves superior sampling efficiency compared to bin-generation-based methods. Additionally, we introduce an adaptive patch dropping strategy to further enhance the quality of the quantized dataset. Extensive experiments on four image classification benchmarks show that BGFDQ consistently outperforms state-of-the-art baselines. In particular, we achieve up to 5% validation accuracy improvement on CIFAR-100. Moreover, our framework successfully scales to datasets containing up to 10^5 same-class samples while existing bin-generation-based approaches fail due to memory constraints. Code is available at https://anonymous.4open.science/r/BGFDQ-F093.
Paperid: 3552,   Poster  
Authors: Tan Junwen, Jinglin Liang, Hongyuan Chen, Shuangping Huang
Title: VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
Abstract: Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Previous acceleration methods rely on caching and reusing, neglecting the growing mismatch between static cached values and evolving input, leading to reduced generated content fidelity.This work proposes Velocity Decomposition and Estimation (VDE), a trainingfree acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.VDE periodically anchors the model’s state with a full forward pass and estimates subsequent outputs analytically. VDE first decomposes the model’s velocity output into components parallel and orthogonal to the input, then exploiting the temporal predictability of the components' coefficients and the consistency of the orthogonal direction for precise, input-adaptive estimation at each timestep.Extensive experiments on image and video generation tasks demonstrate that VDE achieves up to 2.04-3.22× acceleration with minimal loss in visual quality. For example, in image generation, VDE achieves a 2.21× speedup while preserving nearly identical visual quality, outperforming the best baseline by 19.5% in SSIM, 30.3% in PSNR, and reducing LPIPS by 55.4%.
Paperid: 3553,   Poster  
Authors: jiyuan WANG, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu
Title: Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation
Abstract: Pretrained text-to-image (T2I) generative priors have shown success in depth and normal prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by "refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the "consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we repurpose the editor's discarded region for a cost-free joint estimation of depth and normals, which improves the inference efficiency. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100× data.
Paperid: 3554,   Poster  
Authors: Yang Wang, Jiqing Zhang, Chuanyu Sun, Qianhui Liu, Huilin Ge, Ziqi Wei, Xin Yang
Title: SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network
Abstract: Event cameras have attracted considerable attention for object tracking due to their microsecondlevel temporal resolution and wide dynamic range, yet effectively harnessing spiking neural networks (SNNs) in this domain remains challenging. In this paper, we introduce SpikeTrack, a purely spike-driven framework for single-object tracking that addresses the shortcomings of RGB-based approaches in fast-motion or target appearance change. Central to SpikeTrack is the Multi-Search-sequence-and-Single-Template (MSST) training paradigm, which captures rich temporal dependencies, alongside a Dynamic Integer Leaky Integrate-and-Fire (DI-LIF) neuron that adaptively predicts integer-valued activations based on the input features during training and converts them into spikes during inference. Our design preserves the intrinsic sparsity and fine-grained spatiotemporal acuity of event data, resulting in efficient energy consumption without sacrificing performance. Extensive evaluations on FE108, FELT, and VisEvent demonstrate that SpikeTrack exceeds the performance of state-of-the-art trackers in both accuracy and efficiency. Furthermore, ablation studies validate each module’s contribution, highlighting the practical potential of spike-driven architectures for future vision applications.
Paperid: 3555,   Poster  
Authors: Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee
Title: RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
Abstract: Supervised openloop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism enhancement, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.
Paperid: 3556,   Poster  
Authors: fengyu chen, Tiao Tan, Teng Li, Yuantian Quan, Qingmin Liao
Title: MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
Abstract: Radar semantic segmentation (RSS) is critical for robust perception in adverse conditions, but poses unique challenges: radar frequency maps are highly anisotropic, multiscale, sparse and noisy. Conventional CNN or Transformer architectures, designed for camera images, fail to account for these characteristics, leading degraded performance. We propose MARSS (Modular Attention-enhanced Radar Semantic Segmentation), a novel framework that integrates three specialized modules to address radar-specific issues. In the encoder, the RADE module employs lightweight channel self-attention and depthwise convolutions to robustly encode noisy, anisotropic features. In intermediate layers, the RFAF module performs multi-scale feature fusion and region-level attention to isolate salient radar features. The decoder's RADM module combines state space models with axial self-attention to reconstruct segmentation masks with anisotropy and temporality-aware context. These components collectively suppress noise, disentangle range-Doppler features, and enforce spatial-temporal consistency. On the CARRADA dataset, MARSS achieves substantially higher performance than prior RSS methods, especially for small fast-moving targets.
Paperid: 3557,   Poster  
Authors: Zhiwei Zhong, Peilin CHEN, Qiangqiang Shen, Bo Li, Shiqi Wang
Title: Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
Abstract: Depth map superresolution with color guidance is a fundamental task in computer vision that aims to reconstruct high-resolution depth maps by leveraging structural correlations from corresponding guidance images. Recently, with the development of deep learning techniques, the performance of guided depth super-resolution (GDSR) models has been significantly improved. However, most existing approaches rely on black-box architectures that lack theoretical interpretability. Although graph optimization has been explored to integrate model-driven and data-driven frameworks, it remains computationally expensive and struggles to preserve the intrinsic structures of the depth maps. To overcome these limitations, we propose a novel GDSR framework based on a dual graph Laplacian prior, termed LapNet, which efficiently unfolds graph optimization into a deep neural network. Specifically, we first formulate a dual graph Laplacian prior that separately models structural dependencies along the row and column dimensions of the depth maps. This formulation explicitly enforces piecewise smoothness while reducing computational complexity from \mathcalO(H^3W^3) to \mathcalO(H^3 + W^3) by avoiding the construction of global affinity graph. Furthermore, we develop a deep implicit prior to extract high-frequency structural cues from the guidance image, serving as a complementary component to the manually designed prior. Finally, we integrate these complementary priors into a unified variational optimization framework, which is efficiently solved through alternating minimization and subsequently unfolded into an interpretable multi-stage deep network. Extensive experiments on both synthetic and real-world datasets demonstrate that LapNet achieves state-of-the-art performance while maintaining low computational complexity.
Paperid: 3558,   Poster  
Authors: xulun ye, Yifan Mei, Kun Zhou, Zelei Wu, Jieyu Zhao
Title: DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
Abstract: The performance of meanbased prototypical methods in few-shot learning is frequently compromised by noise and hard positives, where entangled feature representations cause prototype instability. We present a novel ``Filter-Repair-Expand'' framework grounded in Determinantal Point Process (DPP) theory. The method leverages DPP as its core logic, employing it to estimate sample confidence to filter anomalous samples from the initial set, guide a diffusion process via volume-maximization to enhance the sample representation, and subsequently maximize the volume of synergistic disentangled subspaces, constructing robust and diverse prototype subspaces. Experimental results establish new state-of-the-art performance on multiple benchmarks, demonstrating significant gains in few-shot learning robustness.
Paperid: 3559,   Poster  
Authors: Zhiteng Li, Mingyuan Xia, JINGYUAN ZHANG, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
Abstract: Large Multimodal Models (LMMs) have attained impressive achievements in multimodal processing tasks, yet their massive memory demands pose major obstacles to deployment on resourcelimited devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LMMs, delivering substantial reductions in memory overhead. However, existing SVD-based methods often struggle to effectively alleviate the errors caused by SVD truncation, resulting in a noticeable performance gap when compared to the original models. Moreover, adopting a uniform compression ratio across all transformer layers fails to consider the varying importance of different layers. To tackle these challenges, we propose AdaSVD, an adaptive SVD-based LMM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matricesand. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios according to the relative importance of each layer. Comprehensive experiments across multiple LMM families show the effectiveness of AdaSVD, achieving better performance while significantly reducing memory requirements. We will make all the code and models of AdaSVD publicly available.
Paperid: 3560,   Poster  
Authors: Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Dajiang Zhou, Qunshan Gu, Qi Wang, Li Song
Title: OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
Abstract: Lossless compression is essential for efficient data storage and transmission. Although learningbased lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose OmniZip, a unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence). Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model's nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. It outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42%, 57%, 62% and 42%, 53% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching up to 1MB/s on MacBook CPUs and iPhone NPUs. Our code will be released upon acceptance.
Paperid: 3561,   Poster  
Authors: Liang Cao
Title: A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems
Abstract: In safetycritical industrial applications, accurately identifying causal relationships and quantifying uncertainty is essential for tasks such as root cause analysis, feature selection, and process optimization. Traditional causal discovery methods inadequately handle nonlinearities and complex uncertainties prevalent in industrial sensor data. To address this, we introduce a novel causal discovery framework that integrates Polynomial Chaos Expansion (PCE) representations of stochastic noise into structural equations. This method effectively captures complex nonlinear couplings and arbitrary noise distributions characteristic of industrial data. We rigorously prove the identifiability of causal structures under mild sparsity conditions on the chaos coefficients, significantly extending classical linear non-Gaussian acyclic model (LiNGAM) identifiability results. Extensive experiments on real-world industrial dataset demonstrate superior accuracy, robustness under extreme non-Gaussian noise conditions, and practical uncertainty quantification. This framework presents a principled, interpretable, and computationally feasible approach to causal analysis in nonlinear uncertain industrial environments.
Paperid: 3562,   Poster  
Authors: Jingyuan Gao, Yumeng Hu, Fei Gao, Mingjin Zhang
Title: PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
Abstract: Thermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in lowlight, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account for thermal emission, the reflected component, and atmospheric transmittance to produce physically consistent thermal synthesis. We also introduce VGGT-IR, a Transformer-based feed-forward initializer that takes TIR input with optional RGB and directly regresses camera poses and initial geometry, providing a modality-aligned and stable starting point for PhysIR-Splat. Extensive experiments demonstrate that our method significantly surpasses existing approaches in thermal reconstruction quality and cross-view consistency, effectively suppressing floating artifacts and enhancing boundary sharpness. The code will be made publicly available upon acceptance of the paper.
Paperid: 3563,   Poster  
Authors: Sheng Yu, Di-Hua Zhai, Yuanqing Xia
Title: SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
Abstract: Object pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Categorylevel object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental results on benchmark datasets show that SEGPose outperforms all current category-level pose estimation methods based on point clouds. Additionally, we apply the SEGPose to the robotic grasping tasks in real-world scenarios, and the results indicate that SEGPose exhibits excellent generalization capabilities.
Paperid: 3564,   Poster  
Authors: Taehun Ryu, Changwoo Kang, Kyungdon Joo
Title: From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
Abstract: The conventional checkerboardbased calibration for standard cameras faces fundamental limitations when applied to bio-inspired event cameras. Specifically, this stems from two challenges: (i) Events are triggered asynchronously at different timestamps along motion trajectories. If we accumulate them directly on the image plane, it causes temporal misalignment and produces blurred edges. Directly accumulating them on the image plane causes temporal misalignment and produces blurred edges. (ii) Checkerboard corners on event cameras show near-zero event occurrence at the corner itself. This hinders reliable corner localization and makes calibration difficult. To address these issues, we present a novel calibration framework that directly detects checkerboard corners from a raw event stream. We first mathematically analyze the absence of events at corner points. Based on this fact, we then leverage edge-driven event cues to initialize corner positions. Using the near-zero event occurrence at checkerboard corners, we gradually refine the estimated corner toward low event-density regions, achieving sub-pixel accuracy. Furthermore, we extend the corner detection to fiducial markers such as AprilTags, resulting in reliable detection even under partial visibility or occlusion. Evaluations on self-collected and public data demonstrate reliable checkerboard corner detection and stable camera calibration.
Paperid: 3565,   Poster  
Authors: Yingjie Feng, Yi Wang, Jiaze Wang, Anfeng Liu, Zhuotao Tian
Title: Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
Abstract: Selfsupervised contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition by enforcing consistency in the embedding space. However, existing methods rely on binary contrastive objectives that overlook the intrinsic continuity of human motion, resulting in fragmented feature clusters and rigid class boundaries. To address these limitations, we propose TranCLR, a Transitional anchor-based Contrastive Learning framework that captures the continuous geometry of the action space. Specifically, the proposed Action Transitional Anchor Construction (ATAC) explicitly models the geometric structure of transitional states to enhance the model's perception of motion continuity. Building upon these anchors, a Multi-Level Geometric Manifold Calibration (MGMC) mechanism is introduced to adaptively calibrate the action manifold across multiple levels of continuity, yielding a smoother and more discriminative representation space. Extensive experiments on the NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets demonstrate that TranCLR achieves superior accuracy and calibration performance, effectively learning continuous and uncertainty-aware skeleton representations. Code will be made publicly available.
Paperid: 3566,   Poster  
Authors: Taichun Zhou, Zhibin Dong, Hao Tan, Siwei Wang, Xinwang Liu, En Zhu, Di Hu, Tianrui Liu, chuankun Li, Kunlun He
Title: Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering
Abstract: In realworld applications, multi-view data often suffer from missing situations due to privacy protection and sensor failure factors. The Incomplete scenarios not only lead to partial information availability but also cause significant imbalance learning among views: certain “strong views” dominate the fusion process, while “weak views” contribute marginally, thereby undermining cross-view collaboration. Existing incomplete multi-view clustering methods mainly focus on "how to handle missing data", yet they largely overlook the imbalance view contribution induced by incompleteness and its profound impact on representation learning and clustering performance. To address these issues, our paper first analyzes the data imbalance caused by missing views and the resulting disparities in view learning quality. Then, we propose a collaborative evaluation and enhancement framework (ICER) for imbalanced incomplete multi-view clustering . Specifically, we employ shapley values to quantify the marginal contribution of each view, and incorporate imbalanced optimal transport to characterize distributional deviations across views. On this basis, we construct the view contribution imbalance metric to comprehensively evaluate cross-view collaboration and fusion quality, and design a collaboration enhancement module to explicitly reinforce inter-view cooperative optimization and feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing incomplete multi-view clustering approaches, validating the effectiveness and necessity of explicitly modeling and mitigating view imbalance in imbalanced incomplete scenarios.
Paperid: 3567,   Poster  
Authors: Minh Quan Dao, Dimitris N. Metaxas
Title: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Abstract: Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flowmatching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices.
Paperid: 3568,   Poster  
Authors: Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, Zhengming Ding
Title: Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Langugae Model Blindness
Abstract: Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by objectcentric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM’s attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.
Paperid: 3569,   Poster  
Authors: Xu Wang, Zhiru Wang, Shiyun Xie, Chengwei Pan, Yisong Chen
Title: DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
Abstract: 3D Gaussian Splatting achieves realtime photo-realistic rendering but struggles when training images contain transient objects that violate multi-view consistency. Existing methods face a fundamental dilemma: accurate transient detection requires well-reconstructed static scenes, yet clean reconstruction depends on reliable transient masks. This circular dependency causes persistent artifacts when both components are jointly optimized from poor initialization. We present DualSplat, a two-stage framework which sidesteps this dilemma by first generating pseudo masks from reconstruction failures, then using them to guide clean scene optimization. We observe that transient objects manifest as incomplete fragments during initial training, since they appear in only a subset of views. We consolidate these failures into pseudo masks via instance-level thresholding and a feature-residual filter guided by SAM2 boundaries. Then we trains a clean 3DGS model under pseudo-mask supervision, with a lightweight MLP refining masks online by progressively shifting from pseudo-priors to self-consistency as densification proceeds. Experiments on RobustNeRF and NeRF On-the-Go demonstrate that DualSplat achieves competitive performance with recent methods, with particularly strong results on scenes with high transient density.
Paperid: 3570,   Poster  
Authors: Xiang Fang, Wanlong Fang, Changshuo Wang
Title: CogniVerse: Revolutionizing Multi-modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
Abstract: Multimodal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models (MLLMs) in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce CogniVerse, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module (CRM) that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Grounded in advanced theoretical frameworks, including convergence guarantees for geometric alignment and spectral optimization, CogniVerse achieves robust cross-modal integration and adaptive knowledge utilization. Extensive experiments on benchmark multi-modal question answering datasets demonstrate that CogniVerse significantly outperforms state-of-the-art MMRAG systems in both accuracy and coherence, while reducing retrieval latency.
Paperid: 3571,   Poster  
Authors: Anni Yu, Yu-Bin Yang
Title: PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition
Abstract: Prototypebased methods enhance interpretability in image recognition by establishing intermediate part prototypes to build interpretable classifiers, enabling transparent reasoning through part-level attention and reference to prototypical examples. However, existing methods typically depend on unimodal visual supervision and constrain prototypes within the visual embedding space, which inherently restricts their semantic alignment with human-interpretable concepts. In this work, we present PRISM (Prototype-based Reasoning with Inter-modal Semantic Mining), an interpretable image recognition framework that leverages natural language as an auxiliary modality to guide the learning of class-specific part prototypes. PRISM introduces an information-theoretic attribution mechanism that identifies semantically salient image regions conditioned on textual descriptions. By aligning these attribution maps with prototype activation patterns, PRISM implicitly anchors visual part prototypes to conceptually meaningful image regions, enhancing interpretability without requiring explicit concept modeling. To further enhance the distinctiveness and localization of prototypes, we introduce a spatial compactness constraint that encourages each prototype to attend to specific, non-overlapping image regions. Extensive experiments on fine-grained benchmarks demonstrate that the proposed PRISM not only improves classification performance but also provides faithful and semantically grounded visual explanations.
Paperid: 3572,   Poster  
Authors: Zhiyi Duan, Xiaoyue Zhang, Tianxing Man
Title: SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
Abstract: Recent VisionLanguage Models (VLMs) have become increasingly susceptible to jailbreak attacks, where adversarial prompts exploit subtle manipulation to circumvent safety alignment.The diversity and adaptability of such jailbreakers necessitate a defense mechanism with strong generalization capability.However, fine-tuning large-scale VLMs is computationally expensive, and introducing excessive visual or textual defense prompts is impractical for preserving image realism and model usability.We propose SafeLogo, which tunes a logo-sized visual prompt into a universal shield against diverse jailbreak attacks through micro-regional adversarial training.We are the first to integrate min–max adversarial optimization into visual defense prompt generation.Specifically, in the outer loop, SafeLogo injects compact, bounded perturbations into extremely small image regions (\leq 2% pixel coverage), effectively preserving both visual fidelity and semantic consistency.Meanwhile, overcoming the limitations of prior defenses constrained to a single attack direction or fixed benign supervision, the inner loop dynamically generates and selects the strongest one from a variety of jailbreakers.Extensive experiments on LLaVA-1.5-13B,MiniGPT-4, and Qwen3-VL show that SafeLogo markedly lowers jailbreak success rates on MM-SafetyBench, VLGuard, and FigStep, while preserving benign performance on MM-Vet and MME.
Paperid: 3573,   Poster  
Authors: Jia Wu, Yijing Dai, Tingfeng Cao, Meiling Wu, Tao Luo, Jian Dong Zhang, Guangming Lu, Xiaoyi Zeng
Title: High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
Abstract: Diffusionbased virtual try-on methods rely on vast high-quality garment-person pairs, which are scarce in practice due to the high cost of data collection and preprocessing, limiting their performance in real-world scenarios.To overcome this bottleneck, we propose Cycle-Consistent Virtual Try-On (CCVTON), a diffusion-based approach that enables effective training using massive in-the-wild person images. Specifically, CCVTON introduces a Cycle-Consistent Learning (CCL) strategy that just employs a single unified generative model to disentangle a garment from a person image (try-off branch) and transfer it to the same individual (try-on branch), forming a reconstruction cycle. To this end, we first warm up a Unified Diffusion Transformer (UDiT) on open-source paired data to acquire basic try-on and try-off capabilities. When adapting UDiT to in-the-wild person images, we employ a Multi-Criteria Filtering Operation to select high-quality garments disentangled from person images by the pretrained UDiT. These filtered garments are not used as inputs for CCL, but serve as soft constraints for a perceptual regularization loss, preventing the try-off branch from collapsing to trivial copying. In addition, we propose a garment-aware mask generation with a two-stage refinement process to suppress garment leakage while maintaining person consistency.Extensive experiments show that CCVTON achieves state-of-the-art results.
Paperid: 3574,   Poster  
Authors: Yubo Wang, Yan Lu, Bin Liu, Xulin Li, Jixiang Niu
Title: R$^2$TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
Abstract: TextImage Person Re-Identification (TI-ReID) models are widely deployed in intelligent surveillance.Built on deep neural networks and vision–language models, TI-ReID models inherit their vulnerability to adversarial attacks, posing potential security risks.Yet their security issues have received far less attention than retrieval accuracy, and the robustness of TI-ReID to adversarial attacks remains largely unexplored.To fill this gap, we propose Reconstruction-residual based Targeted and Untargeted Attack (R^2TUA), which takes an image and an adversarial text prompt as input and generates perturbations that make TI-ReID models incorrectly match the perturbed image to the identity described by the adversarial prompt.To precisely inject identity attributes into perturbations and achieve fine-grained targeted attack, R^2TUA proposes Transformer-based Gradual Multimodal Fusion (TGMF) that fuses image and adversarial prompt progressively across layers with tunable cross-modal weight.In addition, we propose a fully differentiable Soft Clamp Function (SCF), which enables us to ensure perturbations remain inconspicuous while avoiding local gradient vanishing effects that would trap training into suboptimal local minima.To further align perturbed images with the adversarial text descriptions while leading them to mismatch their original descriptions, R^2TUA employs Push-Pull Losses (PPLs) and matching losses during training.Extensive evaluations across multiple datasets and models demonstrate the superior untargeted attack and targeted attack performance of R^2TUA. It also exhibits strong adaptability and transferability against black-box models, outperforming all related attacks across multiple tasks.
Paperid: 3575,   Poster  
Authors: Wenxuan Miao, Yulin Sun, Aiyue Chen, Jing Lin, Yiwu Yao, Yiming Gan, Jieru Zhao, Jingwen Leng, Minyi Guo, Yu Feng
Title: TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
Abstract: The recent surge in video generation has shown the growing demand for highquality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations.In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality (<0.06% loss on VBench).
Paperid: 3576,   Poster  
Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan LI, Xing Zhu, Yujun Shen, Qifeng Chen
Title: MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
Abstract: We propose MagicQuill V2, a novel framework that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of modern diffusion models and the granular control of traditional graphics software. While stateof-the-art diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and style. To overcome this limitation, our method deconstructs creative intent into a stack of independently controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive, and powerful control over the generative process.
Paperid: 3577,   Poster  
Authors: Ufaq Khan, Umair Nawaz, Massimo Caputo, Muhammad Bilal, Junaid Qadir, Muhammad Haris Khan
Title: Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
Abstract: Medical imaging presents significant challenges due to acoustic shadows, motion blur, and indistinct boundaries. Addressing these issues is crucial for improving diagnostic accuracy. Many conventional vision require extensive finetuning on task-specific data and often lose generalizability to natural-image domains. We propose DCRM-ViT, a domain-conditioned residual modulation framework for Vision Transformers that preserves general-vision capability while adapting to diverse domains. DCRM-ViT keeps the backbone frozen and augments each block with a lightweight Residual Modulation Block (RMB) whose parameters are synthesized per sample by a Domain Router (DR) and Parameter Synthesizer Network (PSN). The router outputs soft domain weights from input features, whereas the synthesizer maps these weights to low-rank residuals that modulate selected projections and, optionally, add a domain-aware bias to attention. Crucially, we learn routing and modulation via a bi-level optimization scheme: a short inner loop adapts RMB parameters to task supervision, while an outer loop updates DR, PSN, and RMB initializations/step sizes so the synthesized residuals generalize across medical and natural domains. Across fine-grained classification (Food101, SUN397, Stanford Cars) and medical segmentation (ultrasound, CT, MRI), DCRM-ViT improves over strong baselines while using modest trainable compute. The ablation studies confirmed the benefits of our architectural enhancements, showing improved performance and adaptability. The results demonstrate DCRM-ViT's potential to offer high diagnostic performance with reduced computational overhead of using 4.7 GFLOPs and 0.3 training min/epoch. Our code will be publicly available upon acceptance.
Paperid: 3578,   Poster  
Authors: Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhingra, Seonghyeon Nam, Haitao Yang, Zhuo Hui, Christopher Metzler, Andrea Vedaldi, Hamed Pirsiavash, Lei Luo
Title: LaVR: Latent Space Conditioned Video Re-rendering using Large 4D Reconstruction Models
Abstract: Given a monocular video, the goal of video rerendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors.We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task.
Paperid: 3579,   Poster  
Authors: Jiawei Han, Matteo Poggi, HUAN LI, Changshuo Wang, Kaiqi Liu, Wei Li
Title: Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation
Abstract: The effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposesImageto-Point CloudFeature Back-Projection (IPFP), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonstrate thatIPFPcan consistently improve state-of-the-art 3D semantic segmentation models, while retaining the ability to process LiDAR-only data at inference time.
Paperid: 3580,   Poster  
Authors: Yusheng Li, Lizhi LOU, Yan Tang, Zekai Miao, shaoming zhang, Jianmei Wang
Title: LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
Abstract: Metric depth estimation aims to recover depth maps with absolute scale, high resolution, and crossscene consistency from visual observations. Existing approaches either rely on large-scale models or costly sensors to preserve metric accuracy and generalization, both ill-suited to resource-constrained deployment. In this paper, we proposeLiteSense, a lightweight RGB-ToF fusion framework that leverages compact normalized histogram (CNH) signals together with RGB cues to achieve efficient and reliable metric depth estimation. Specifically, LiteSense leverages a U-Net-style encoder-decoder that forms an RGB-D input by concatenating RGB with upsampled ToF depth, providing explicit metric priors. To address resolution disparity and recover fine details, we introduce thePatch-wise CNH Spatial Injection (PCSI)module, which leverages zone-wise histogram measurements via cross-attention to guide high-level feature fusion. Extensively evaluated on NYUv2 and SUN RGB-D, LiteSense consistently outperforms monocular baselines and DELTAR with substantially lower computational cost, and demonstrates promising zero-shot generalization. We further introduceTHDR3K, the first indoor RGB-ToF-CNH dataset, where LiteSense achieves real-world accuracy comparable to—and in challenging cases surpassing—Intel RealSense. All the relevant source codes and the collected dataset will be released.
Paperid: 3581,   Poster  
Authors: Zhilin Zhu, Yabin Wang, Zhiheng Ma, Yaguang Song, Yaowei Wang, Xiaopeng Hong
Title: Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging
Abstract: Continual TestTime Adaptation (CTTA) aims to empower perception systems to handle real-world dynamic distribution shifts after deployment. However, its efficacy is limited by the scarcity and unreliability of supervision signals, leading to error accumulation and catastrophic forgetting. While existing methods predominantly follow a backward-alignment paradigm, constructing weak supervisory surrogates derived from prior knowledge, they struggle with unreliable supervision and evolving distribution shifts. To overcome this, we propose a novel forward-facilitation paradigm through a dynamic style bridging framework. Specifically, we first construct a compact, offline-generated knowledge base of semantically pure class exemplars to provide reliable content. Subsequently, to mitigate generative bias and handle evolving distribution shifts, we propose a multi-level style bridging mechanism. It dynamically transfers current target domain styles to synthetic proxies at the input, statistics, and representation levels. This process yields on-the-fly proxies that are both semantically reliable and stylistically faithful to the target data, which are then used to construct on-demand supervisory signals, effectively enabling stable and discriminative adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate consistent and substantial improvements over recent state-of-the-art methods.
Paperid: 3582,   Poster  
Authors: Yuwen Tao, Kanglei Zhou, Chang Li, Liyuan Wang
Title: Geometry-aware Cross-modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
Abstract: Referring 3D segmentation seeks to localize and segment target objects in a 3D scene given a naturallanguage query, requiring joint reasoning over geometric structures and linguistic cues. Although recent progress using 3D Gaussian Splatting (3DGS) has improved rendering quality, existing methods still struggle to spatially ground textual references due to two fundamental limitations: (1) language encoders provide no explicit positional priors, weakening geometric relation modeling; and (2) cross-modal attention is self-reinforcing, causing spatial errors to propagate through the Gaussian field once misalignment occurs. To address this, we propose GeoCGA, a geometry-aware cross-modal graph alignment framework that bridges linguistic semantics with the 3DGS representation. GeoCGA introduces position-aware prompt expansion to build a semantic-spatial graph capturing relational structure in text, and constructs a Gaussian-based geometric graph encoding 3D topology. A cross-modal alignment module enforces geometric consistency between the two graphs, enabling stable and spatially grounded correspondence across views. GeoCGA consistently outperforms prior state-of-the-art methods, yielding mIoU improvements of 28.8% on Ref-LERF, 2.6% on LERF-OVS, and 3.1% on 3D-OVS. These results point to an incremental advance toward more stable and spatially consistent 3D language grounding.
Paperid: 3583,   Poster  
Authors: Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral
Title: SceMoS: Local Scene-Aware Human Motion Synthesis by Planning with Geometry-Grounded Tokens
Abstract: Synthesizing textdriven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a top-down bird’s-eye-view (BEV) image of the scene, encoded with DINOv2 features, as the scene representation, and (2) geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.
Paperid: 3584,   Poster  
Authors: Zhengbo Jiao, Zifan Zhang, Shaobo Wang, Wei Wang, Bing Zhao, hu wei, Linfeng Zhang
Title: Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced vision–language understanding. However, even stateof-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image–text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure both fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with post-hoc filtering, decoupling generation from learning needs.We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image–text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding the Teacher’s targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated “image–code–instruction” triplets, distilling programmatic drawing intelligence into visual generation.Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six geometric benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).
Paperid: 3585,   Poster  
Authors: Yang Li, Zhaxizhuoma Zhaxizhuoma, Hongru Jiang, Junjie Xia, Hongquan Zhang, Jinda Du, Yunsong Zhou, Jia Zeng, Ce Hao, Jieji Ren, Qiaojun Yu, Cewu Lu, Yu Qiao, Jiangmiao Pang
Title: Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
Abstract: Embodied intelligence for contactrich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. Foca-VLA introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a cross-scale routing Mixture-of-Experts (MoE) with impedance control in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct Foca-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that Foca-VLA substantially improves success rates and reliability in contact-rich manipulation, outperforming Pi0 and Pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby advancing force-aware physical intelligence in VLAs.
Paperid: 3586,   Poster  
Authors: Qingsong Xie, Luyuan Zhang, Zhao Zhang, Siyuan Li, Zhe Huang, Zhenyu Yang, Haonan Lu
Title: LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
Abstract: Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and quality: highresolution image generation either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (LacTok) that bridges discrete visual tokens with the compact latent space of pretrained latent diffusion models (LDMs), enabling efficient representation of 1024×1024 images using only 256 tokens—a 16× compression over VQGAN. LacTok integrates a transformer encoder, a quantized codebook, and a latent consistency decoder.Direct application of LDM as the decoder results in color and brightness discrepancies;thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. To endow LacTok with text-to-image generation capabilities, we seamlessly integrate it with an autoregressive transformer, forming LacTokGen. This transformer is trained by predicting compact token sequences conditioned on text instructions.Experiments demonstrate LacTok’s superiority in high-fidelity reconstruction, with 10.8 reconstruction Fr\'echet Inception Distance on MSCOCO-2017 5K benchmark for 1024×1024 image reconstruction.LacTokGen achieves 0.73 score on GenEval benchmark and 0.304 HPSv2 on MSCOCO-2017 dataset.
Paperid: 3587,   Poster  
Authors: Yang Gao, Wuyang Li, Po-Chien Luan, Alex Alahi
Title: Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
Abstract: Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about humancentric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under self-supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding.
Paperid: 3588,   Poster  
Authors: Shengju Yu, Suyuan Liu, Wenhao SHAO, Siwei Wang, KE LIANG, Xihong Yang, Tiejun Li, Xinwang Liu
Title: Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization
Abstract: Prevailing incomplete multiview clustering (IMVC) approaches typically fail to account for the interference of view-exclusive artifacts when learning view-consensus representations, which could compromise the fidelity of the resulting similarity measure. Moreover, inconsistencies in anchor order across views may distort the graph structure, impairing the clustering performance. The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model's practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. It explicitly models the exclusive-to-consensus interplay, derives a view-private graph from each view, and adaptively integrates them into a global consensus affinity according to the respective view's intrinsic characteristics. Furthermore, a permutation transformation with unary encoding constraints is applied to anchor matrix, realigning anchor topology while preserving the values. This process synchronizes anchor order prior to similarity integration and maintains original anchor properties. Notably, all components are coupled seamlessly and optimized in a joint manner. Also, the provable overall linear complexity further enlarges its scalability and practicality. Experimental results confirm that PJFTH receives competitive performance compared to several leading methods.
Paperid: 3589,   Poster  
Authors: Hai Jiang, Zhen Liu, Yinjie Lei, Songchen Han, Bing Zeng, Shuaicheng Liu
Title: ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models
Abstract: In this paper, we propose a zeroreference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code will be released to facilitate future research.
Paperid: 3590,   Poster  
Authors: Maisha Maliha, Dean F. Hougen
Title: CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
Abstract: Vision Transformers often rely on spurious background correlations rather than foreground object features. While prior model pruning approaches focus solely on improving accuracy, they lack interpretability and fail to verify whether predictions are actually made by focusing on the main foreground object, providing no causal validation of which components drive spurious behavior. We introduce CIGMA (Causal Information Gain Mechanistic Attribution), a general framework for explaining the internal computation of Vision Transformers. CIGMA provides a mechanistic, information theoretic explanation by quantifying the importance of each attention head and determining whether it supports the main object or routes spurious background cues. It ranks attention heads by measuring object versus context reliance with Jensen Shannon based information gain computed from the model's full predictive distributions after two complementary edits, removing the object region and removing the surrounding context, which reveals a spurious subnet that carries background signals and a complementary set of evidence aligned heads. Evaluated on CIFAR10, CIFAR-100, and Tiny-ImageNet across three VLM architectures (InternVL2-26B, LLaVA-1.6, LLaVA-1.5-13B), CIGMA improves accuracy by 7.6-24.8 percentage points over unmodified models while reducing background reliance by 79.5-88.1%, substantially outperforming all baselines, demonstrating that causal head-level interventions enable more effective spurious correlation mitigation than token pruning or retraining approaches.
Paperid: 3591,   Poster  
Authors: Yuqi Chen, GAO JUNJIE, Pan Yongzhou, Siyuan Song, ZIXUAN ZHANG, Jiaping Xiao, Mir Feroskhan
Title: GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
Abstract: Imagegoal navigation driven by generative models has recently shown strong potential owing to their ability to perform multi-modal reasoning and stable learning in continuous control spaces.Despite their promise, current methods still face several fundamental limitations.Many rely on pre-built priors and lack explicit mechanisms for trajectory evaluation, restricting generalization and goal alignment in map-free navigation. Moreover, current generative policies often face inefficiency or temporal inconsistency, resulting in temporally unstable motion. The absence of interactive, closed-loop benchmarks further limits fair and reproducible comparison.To address these issues, we propose GeniNav, a generative image-goal navigation framework that couples a VLM-driven latent subgoal imagination module for high-level semantic guidance with Multi-Segment Consistency Flow Matching (MS-CFM) for temporally smooth and dynamically coherent motion generation. A hybrid trajectory evaluation module further integrates semantic alignment and geometric feasibility to assess goal consistency.We also introduce a closed-loop simulation benchmark with a large-scale dataset spanning 176 scenes and 491.6 km for standardized training and evaluation. Extensive experiments in simulation and on real robots demonstrate the effectiveness of our method.
Paperid: 3592,   Poster  
Authors: Yi Zhu, Hao Xiong, Lin Xiao, Ranfeng Shi, Qinying Gu, Leilei Gu
Title: SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
Abstract: Accurate depth estimation in wide field is highly desired in applications of autonomous driving, robot vision and drone controls. Biological compound eyes inspire wide Field of View (FOV) depth estimation, yet their artificial implementations face the challenge of modality misalignment. Specifically, the spherical imaging data doesn’t align with the planar neural network, diminishing the learning efficiency. Herein, we propose SCEDepth, a bio-inspired framework for spherical compound eye depth estimation, which processes spherical images natively on a HEALPix grid using a spherical neural network. This approach achieves a unified 180^\circ FOV while avoiding the errors typically introduced by modality conversion. Additionally, we identify a depth-sensitive gradient feature from the overlapping FOVs of adjacent ommatidia. To exploit it, we introduce a spherical Sobel operator called the Spherical Gradient Feature Extractor (SGFE) and a corresponding Spherical Gradient Loss (SGL), which jointly extract gradient features on the HEALPix grid, enabling gradient-aware depth prediction. Extensive benchmark experiments demonstrate that these strategies enable SCE-Depth to substantially reduce depth estimation error compared to fisheye-based baselines, with particularly large improvements in peripheral accuracy. We also demonstrate the generalization capability of SCE-Depth to other wide FOV data modalities, such as fisheye and panoramic imagery.
Paperid: 3593,   Poster  
Authors: Shule Yan, Zetian Zhang, Xiao Ma, Zexuan Ji
Title: Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection
Abstract: Modern detectors typically deepen backbones and rely on aggressive downsampling to harvest highlevel semantics. But this severely degrades low-energy infrared tiny targets via rescale-induced information loss. This work introduces InvDet, a target-aware invertible encoder that unifies information preservation and target-aware enhancement within a reconstruction-guided detection framework. An invertible pathway reconstructs the input from feature latents, exposing information loss as an optimizable quantity. To decouple detection from irrelevant reconstruction, a Target-Aware Reconstruction Modulation (TARM) module operates only in the inverse path, gating high-pass latents and applying a mild gain to low-pass features without altering the forward detection distribution. In addition, a Geometry–Content Tolerance Metric (GCTM) is proposed to focus on truly informative regions and yields a pixel-wise weight map that gently regularizes the reconstruction branch. Our method yields state-of-the-art accuracy on five public infrared benchmarks, providing a principled pathway toward detection-friendly representation learning for scale-challenged visual regimes.
Paperid: 3594,   Poster  
Authors: Yuxuan Liang, Fan Shi, Rui Zhu, Xu Li, Xiaolei Chen, Zhe Liu, Bin Li, Xiangyang Xue
Title: Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models
Abstract: Large VisionLanguage Models (LVLMs) often hallucinate when visual evidence conflicts with world knowledge, i.e., in counterfactual scenarios. We proposeEnvision-Attend-Respond (EnAR), a training-free framework that leverages visual priors to steer the model's attention toward counterfactual elements in the image. TheEnvisionstage constructs a visual impression by invoking a diffusion prior to perform latent perturbations, yielding a prior-consistent counterpart of the input image. TheAttendstage processes the original image and its visual impression through the LVLM's vision encoder to localize counterfactual elements, forming a corresponding padded input. TheRespondstage performs contrastive decoding between the original and padded inputs to suppress bias and enhance visual understanding. Empirically, EnAR consistently mitigates hallucinations and improves response fidelity, achieving a 10.82% gain on VLMBias and an average 6.9% improvement on POPE, demonstrating robustness across both counterfactual and general hallucination settings. Moreover, the framework remains effective across heterogeneous LVLM architectures, offering a new perspective for hallucination governance in multimodal reasoning.
Paperid: 3595,   Poster  
Authors: Yicheng Wu, Tao Song, Zhonghua Wu, Jin Ye, Zongyuan Ge, Wenjia Bai, Zhaolin Chen, Jianfei Cai
Title: Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
Abstract: Magnetic resonance imaging (MRI) is a powerful and versatile imaging technique, offering a wide spectrum of information about the anatomy by employing different acquisition modalities. However, in the clinical workflow, it is impractical to collect all relevant modalities due to the scan time and cost constraints. Virtual fullstack scanning aims to impute missing MRI modalities from available but incomplete acquisitions, offering a cost-efficient solution to enhance data completeness and clinical usability. Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various ``any-to-any'' imputation tasks as a region-level full-stack code prediction problem. CodeBrain adopts a two-stage pipeline: (1) it learns the compact representation of a complete MRI modality set by encoding it into scalar-quantised codes at the region level, enabling high-fidelity image reconstruction after decoding these codes along with modality-agnostic common features; (2) it trains a projection encoder to predict the full-stack code map from incomplete modalities via a grading-based design for diverse imputation scenarios. Extensive experiments on two public brain MRI datasets, i.e., IXI and BraTS 2023, demonstrate that CodeBrain consistently outperforms state-of-the-art methods, establishing a new benchmark for unified brain MRI imputation and enabling virtual full-stack scanning.Code will be released.
Paperid: 3596,   Poster  
Authors: Quanyu Zhang, Zhongyi Han, Hao Sun, Yongshun Gong, Xiaoyan Wang, Yilong Yin, Shuo Li
Title: Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
Abstract: Pretraining on largescale data followed by fine-tuning has become a standard paradigm for visual models. However, noise in the pretraining data can be absorbed by the model and carried into downstream tasks, causing catastrophic inheritance, where inherited pretraining noise reduces downstream generalization. Prior studies mainly link this issue to changes in the feature spectrum, arguing that noise reduces the strength of key feature components. Following this view, they aim to improve transferability by amplifying these components. However, these approaches focus only on spectral energy and implicitly assume that the feature directions remain fixed, which does not hold in practice. In this work, we revisit this view and reveal an overlooked effect: even mild pretraining noise can cause a clear rotation of the dominant feature subspace, despite negligible spectral energy degradation. To quantitatively characterize this phenomenon, we propose using the Principal Directional Angle (PDA) to measure the directional shift between the clean and noisy models. Building on this observation, we introduce the Feature Geometry Stabilization (FGS) framework, which aims to counteract the subspace rotation revealed by PDA by enhancing the geometric stability of the feature space through the synergistic interaction of perturbation consistency, variance-activation regularization, and feature consistency distillation. Experiments across multiple visual benchmarks demonstrate the effectiveness of FGS and verify the importance of stabilizing feature geometry to mitigate catastrophic inheritance.
Paperid: 3597,   Poster  
Authors: j zg, Fang Zhang, YongXiang Hua, Bocheng Li, Wentao Zhang, Linli Xu
Title: FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
Abstract: Autoregressive (AR) models have achieved remarkable success in natural language processing, yet their application to image generation faces significant challenges. When implementing VQbased decoders for autoregressive image generation, the generated images typically preserve semantic information but struggle with fine-grained details. Recent hybrid AR image generation approaches address these issues by integrating diffusion models as decoder heads, enabling more high-fidelity generation. However, the diffusion-based denoising process introduces significant computational overhead during inference.To accelerate hybrid AR image generation, we propose the Lookahead Decoding Strategy, which integrates the strengths of autoregressive and diffusion models by separating the process into two complementary branches: semantic prediction and detail refinement. The autoregressive branch captures high-level semantic structures while refining coarse predictions made by the parallel. Furthermore, we introduce Guided Diffusion Sampling to steer the diffusion denoising trajectory, significantly reducing the number of denoising steps. Extensive experiments demonstrate that our approach provides an effective solution for accelerating hybrid AR image generation models.
Paperid: 3598,   Poster  
Authors: Kiseok Choi, Jaemin Cho, Inchul Kim, Min H. Kim
Title: Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
Abstract: Xray computed tomography (CT) suffers from severe metal artifacts when high-attenuation objects such as dental fillings or orthopedic implants are present. These artifacts originate from the polychromatic nature of X-rays, where attenuation varies strongly with photon energy and material composition, breaking the monochromatic assumption used by conventional reconstruction algorithms. Recent neural rendering approaches attempt to address this mismatch through differentiable polychromatic projection models, but they still struggle with smoothness bias, loss of fine structures, and prohibitive computation when extended to large-scale cone-beam CT. We introduce a splat-based metal artifact reduction framework that incorporates a physically grounded polychromatic forward model into a continuous Gaussian representation for cone-beam CT. Each Gaussian encodes the energy-dependent attenuation of the underlying material using a compact material parameterization, which enables efficient joint optimization of geometric and material properties without relying on a metal mask. This compact attenuation formulation captures the essential variation across biological tissues and metallic implants, allowing our model to explain metal-induced nonlinearity while preserving high-frequency structure. Experiments on simulated and real cone-beam CT scans show that our method converges significantly faster and suppresses metal artifacts more effectively than existing reconstruction and neural field-based approaches.
Paperid: 3599,   Poster  
Authors: Shangran Lin, Lu Lu, Jian Chen, Qiang Liu
Title: RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
Abstract: The prohibitive cost of 3D attention hinders highquality video generation with diffusion models. Existing sparse attention methods either lack content adaptivity (static) or incur excessive overhead from per-step recalculation (dynamic). Our work challenges the necessity of this trade-off, based on a twofold empirical discovery: (1) attention patterns in video diffusion exhibit strong temporal stability, and (2) the requisite computational density progressively decays. This insight motivates RAPID, a framework that performs a one-shot attention block importance estimation early in the generation process. The resulting scores and high-fidelity sparse mask are then cached for efficient reuse, eliminating recalculation overhead. The cached scores also enable an optional, multi-stage adaptive pruning (Turbo mode) for maximum acceleration. On leading models like Wan2.1-14B and HunyuanVideo, our high-fidelity configuration surpasses all baselines across key quality metrics (PSNR, SSIM, LPIPS) under a controlled compute budget. Concurrently, its Turbo mode achieves speedups of up to 1.79× on Wan2.1-14B and 2.01× on HunyuanVideo while maintaining strong visual quality.
Paperid: 3600,   Poster  
Authors: Jianwei Fei, Yunshu Dai, Xiaoyu Zhou, Zhihua Xia, Alessandro Piva
Title: Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection
Abstract: Extracting reliable generative traces in generated images is critical for AIgenerated images (AIGIs) detection. However, a fundamental challenge exists: AIGIs inherently contain generative traces with no trace-free counterpart available, making supervised extraction of these artifacts infeasible. In this work, we overcome this through a surrogate supervision framework. We design a dynamic reconstructor that simulates diverse generative traces on real images through stochastically varied architectures and parameters. The reconstruction residuals serve as supervision to train an extractor that learns to isolate traces, i.e., generative signatures (GenSign). A detector then fuses extracted GenSign with RGB features to distinguish real images from AIGIs. Our key insight is that sufficient architectural diversity in simulation enables effective transfer to real-world generators, resolving the absence of ground truth GenSign. Extensive experiments across four benchmarks demonstrate state-of-the-art generalization, confirming that our simulation-based learning paradigm is capable of extracting general and transferable forensic features.
Paperid: 3601,   Poster  
Authors: Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang
Title: SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
Abstract: While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception.Current methods, such as latent query alignment, are endto-end yet opaque "black boxes".Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step.To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway.Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace.The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space.A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder.The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision.This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts.Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance.Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment.Faithful code will be released publicly.
Paperid: 3602,   Poster  
Authors: Yiwei Fu, Hui Wan, Xiao Luo, Minghua Deng
Title: STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models
Abstract: This paper studies the problem of universal testtime prompt learning for vision-language models (VLMs) which aims to enhance prompt learning for a pre-trained VLM via unlabeled target data containing out-of-distribution (OOD) samples. However, existing test-time adaptation approaches often overlook class-specific diversity in the target domain and rely on unreliable pseudo-labels due to inadequate uncertainty estimation, which may result in additional adaptation bias during test time. Towards this end, we propose a novel framework named Separability-aware Conjugate Optimization with Prototypical Retrieval (STAR) for universal test-time prompt learning of VLMs. The core of our STAR is to incorporate a separability-aware gating mechanism into conjugate optimization for reliable pseudo-learning with OOD samples. In particular, we first compute the Fisher score to quantify the separability between in-distribution (ID) and OOD samples, which guides our soft gating mechanism for divided training. Then, we employ conjugate optimization to derive reliable pseudo-labels of unlabeled data for test-time adaptation. To further mitigate biases in OOD detection, we maintain a dynamic memory bank which stores high-confidence samples to build class-wise prototypes, which would serve as queries for prototypical retrieval to calibrate OOD detection. Extensive experiments on multiple benchmarks demonstrate that STAR consistently outperforms competing baseline methods.
Paperid: 3603,   Poster  
Authors: Liang Qin, Min Wang, Xingyu Lu, Aowen Qiu, Wengang Zhou, Houqiang Li
Title: UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
Abstract: Active search and tracking of arbitrary targets by Unmanned Aerial Vehicles (UAVs) in cluttered environments remains a highly challenging problem. Existing methods either construct complex modular pipelines, leading to substantial computational costs, or adopt endto-end controllers that often fail to generalize across different targets and scenes. Moreover, search and tracking are typically treated separately despite their strong interdependence.In this paper, we present UAST, a simple yet effective mapping-free framework that unifies active search and persistent tracking using only RGB-D observations. The proposed system couples a dual-branch perception module with a Rule-Based Point Search Policy that adaptively switches between tracking and search-based recovery. A lightweight control network generates dynamically feasible trajectories directly from fused perception and UAV states. Furthermore, we introduce a training strategy with an elaborated tracking-aware visibility loss and a tailored data construction.Extensive experiments in both simulated and real-world environments show that our approach achieves higher success rates, more stable long-term tracking, and faster target search compared with existing methods, while maintaining high efficiency. The code will be released upon publication.
Paperid: 3604,   Poster  
Authors: Zihao Zhang, Aming Wu, Li Yang, Yahong Han, Jialie Shen
Title: Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
Abstract: Novel Class Discovery in Point Cloud Segmentation is recently proposed, aiming to leverage knowledge from known classes to automatically segment unlabeled classes within point clouds. The core of this task lies in leveraging the geometric and semantic knowledge of multiple known classes to achieve semantic understanding and segmentation of novel classes.However, existing methods overlook the highorder associations between known and novel classes, relying solely on binary associations for class assignment and novel class reasoning, which leads to less precise semantic segmentation.To address these issues, we introduce a hypergraph structure to model high-order associations among classes, enabling collaborative reasoning from known classes to novel classes, extending beyond traditional binary relations.Additionally, existing methods focus excessively on extracting semantic information when processing point cloud data, neglecting the importance of geometric features. To address this, we introduce Geometric-Aware Prototypes, enhancing the model's ability to capture geometric spatial information.By propagating geometric information through hyperedges, our method enhances the understanding of spatial distributions across classes, improving segmentation accuracy.Significant performance improvements achieved on the SemanticKITTI and SemanticPOSS datasets demonstrate the superiority of our method.
Paperid: 3605,   Poster  
Authors: Xingyu Liu, Pengfei Ren, Qi Qi, Haifeng Sun, Zirui Zhuang, Jianxin Liao, Jingyu Wang
Title: Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
Abstract: Understanding handobject interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learning. As semantic and motion priors emerge, the STONE phase enforces rigid constraints to consolidate articulated structures and explicitly estimates motion parameters. Experiments on a real-world manipulation dataset show that our method achieves state-of-the-art reconstruction quality and plausible articulation modeling from monocular videos.
Paperid: 3606,   Poster  
Authors: Alexis Jensen, Pei Xu, Ioannis Karamouzas, Charles Pontonnier, Julien Pettré
Title: Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
Abstract: We present a physicsbased method for simulating full-body agents that recover balance by stepping or applying contact forces after being perturbed in dense crowds. While traditional 2D crowd simulations focus on navigation and social interactions in moderately dense settings, interactions in highly dense environments are predominantly physical, leading to push propagation, falls, and potential hazards. Existing models cannot capture how forces are transmitted through the body at the limb level. To address this, we use physics-based anthropomorphic simulations combined with a two-stage deep reinforcement learning framework. In the first stage, a policy is pre-trained using reference motion data and general balance rewards, enabling agents to handle a wide range of perturbations. In the second stage, an adaptive phase refines the policy to allow socially aware interactions, using hand contacts for stabilization guided by an online heuristic targeting neighbors’ shoulders based on mechanical efficiency and collision risk. Ablation studies validate the training framework and reward components, and simulations reproduce trends observed in empirical studies of push propagation. Our method scales to large populations, offering new opportunities to study safety and collective behavior in dense crowds.
Paperid: 3607,   Poster  
Authors: Wang Changshuo, Jiangming Wang, Ke-Yue Zhang, Taiping Yao, Shouhong Ding, Shunli Wang, Ran Yi, Lizhuang Ma
Title: Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
Abstract: We revisit the feature learning process of stateof-the-art deepfake detectors that leverage ViT-based vision foundation models and discover that the [\textttCLS] token, commonly adopted for detection, suffers from the Pre-trained Information Bias (PIB), i.e., it tends to mainly focus on global semantics due to the knowledge dominated by pre-trained model parameters, while struggling to emphasize subtle local forgery cues. To overcome this limitation, one potential way is incorporating the token-level features to reform a new detection-specific token. To this end, we propose Query-Driven Token-Level Forgery Purification (QTFP) framework enabling the model to better capture local forgery traces without losing useful pre-trained prior. Specifically, we first introduce randomly initialized, learnable query tokens independent of the backbone and prior knowledge, which can effectively aggregate multi-patch evidence into a global token for detection. To make query tokens focusing on meaningful regions, we propose a theoretical fake-likelihood contrastive learning loss, which employs a weighting strategy to highlight significant fake regions while diminishing the impact of real-like patches. Using SNR theory, we verify that the designed weight is both reliable and informative. To further maintain useful authentic information, a real-attention alignment constraint is applied to query tokens. These designs go beyond relying solely on the [\textttCLS] token by jointly reorganizing real and fake information across all tokens, which successfully enhance detector robustness. Extensive experiments on diverse datasets demonstrate the effectiveness of our method.
Paperid: 3608,   Poster  
Authors: Mingxuan Zhou, Shuang Li, Yutang Zhang, Jing Geng, Yirui Shen, Jingxuan Kang, Fuzhen Zhuang, Shuigen Wang
Title: Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors
Abstract: Infrared video acquisition inherently suffers from low spatial resolution and limited frame rates due to the physical constraints of thermal imaging sensors. These limitations make infrared video enhancement uniquely challenging, as it requires restoring spatial details and temporal continuity from highly undersampled thermal signals. To address this challenge, we propose `THERIS`, a unified THERmalphysics inspired framework for Infrared spatial-temporal video Super-resolution. Grounded in the physical principles of thermal diffusion, `THERIS` leverages heat conduction dynamics that govern the spatiotemporal evolution of infrared pixel intensities. Specifically, the proposed Thermal Diffusion Interpolation Module (TDIM) treats temporal feature sequences as one-dimensional heat fields and performs frequency-domain diffusion to synthesize temporally coherent intermediate frames. Building on this foundation, the Thermo-Aware State Space Module (TSSM) refines spatiotemporal representations through learnable spectral filtering and selective state-space modeling, while maintaining consistency guided by the thermodynamic prior inherited from TDIM. Additionally, a Temperature Field Modeling Loss is introduced to enforce adherence to the heat conduction equation, promoting temporal coherence and spatial stability in the generated results. Extensive experiments demonstrate that `THERIS` achieves state-of-the-art performance while producing visually coherent results. To facilitate further research in the infrared video processing domain, we also introduce IRVAL, a high-resolution dataset comprising 108,512 video frames at 512×512 resolution.
Paperid: 3609,   Poster  
Authors: Boyang Guo, Liang Li, Linpeng Linpeng, Yuhan Gao, Xichun Sheng, Chenggang Yan
Title: Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
Abstract: Prompt learning has emerged as an efficient alternative to finetuning pre-trained vision-language models (VLMs).Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization.First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features.This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM.Second, we introduce neural-collapse–driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss.These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment.Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.We will release all source code.
Paperid: 3610,   Poster  
Authors: Wei Xiang, Yexinrui WU, Xinli Chen, Xinran Li, Shi Chen
Title: UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance
Abstract: User Interface (UI) display defect detection poses challenges far beyond UI understanding, requiring finegrained element boundary understanding, missing-content detection, and reasoning about sequential interface semantic consistency. However, the capabilities of multimodal large language models (MLLMs) and vision-language models (VLMs) for detecting UI defects in realistic, complex interfaces have not been systematically validated. To fill this gap, we present UI-Lens, the first multi-dimensional UI display detection benchmark for Chinese-language UI scenarios. The dataset comprises 4,759 pages meticulously annotated by design experts, covering six core display defect categories. We conduct a systematic evaluation of 10 mainstream models (8 closed-source, 2 open-source). Results show clear shortcomings in current models: for tasks requiring fine-grained element boundary understanding, performance is near random, with task-average F1 scores of 20.36% and 31.21% on Text Overflow and Container Overlap, respectively; for sequential interface semantic consistency (e.g., Text Inconsistency), the task-average F1 score is only 10.61%, indicating severe underperformance. We release UI-Lens to catalyze research toward more robust UI display defect detection with fine-grained boundary awareness in realistic, complex interfaces.
Paperid: 3611,   Poster  
Authors: Can Zhang, Gim Hee Lee
Title: SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
Abstract: Existing methods for categorylevel object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO , a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines. Our souce-code will be made publicly available.
Paperid: 3612,   Poster  
Authors: Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He
Title: OVOD-Agent: A Markov–Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Abstract: OpenVocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision–language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual Chain of Thought with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent’s state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent improves existing OVOD baselines and outperforms prior methods on novel categories, demonstrating strong generalization and scalability.
Paperid: 3613,   Poster  
Authors: Xiaolei Wang, Yuexin Wang, Tianhong Dai, Huihui Bai, Yao Zhao, Jimin Xiao
Title: Hunting Normality from Query Sample via Residual Learning for Generalist AnomalyDetection
Abstract: Generalist Anomaly Detection (GAD) seeks to overcome the domainspecific limitations of traditional anomaly detection by training a unified model that can generalize to unseen classes. A promising GAD strategy involves using residual features to create a class-invariant space. However, existing methods that directly model the distribution of residuals face unpredictable risks: there is inconsistency between residual and instance features, i.e., subtle defects may yield small residuals (false negatives), or normal feature residuals could be large due to the diversity of normality (false positives). To address these limitations, we propose a novel residual-based learning framework that re-purposes residuals as a guide to learn instance-level normality, rather than modeling their distribution directly. Our framework features two new attention-based modules: Residual Feature Learning (RFL), which uses learnable proxies to capture diverse patterns from the residual features, and Normality Learning from Support (NLS), which leverages these residual proxies to aggregate query-related normality proxies from the support instance features. These dynamically generated normality proxies are then used to hunt for normality within the query patch features, enabling accurate anomaly localization. Extensive experiments on GAD benchmarks demonstrate the effectiveness of our method. The code will be made publicly available.
Paperid: 3614,   Poster  
Authors: Anjie Le, Can Peng, Yuyuan Liu, Alison Noble
Title: POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
Abstract: In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting.In this work, we extend the notion of unlearning to the representation level, deriving a threeterm interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower-dimensional space, yielding a provably optimal forgetting operator.We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and feature-level unlearning variants under a distillation scheme (POUR-D).Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics. Code will be released upon acceptance of the paper.
Paperid: 3615,   Poster  
Authors: Md. Borhan Uddin, Arif Raza, Zhiliang Lin, Lu Wang, Jianqiang Li, Jie Chen
Title: Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
Abstract: Endto-End (E2E) Reinforcement Learning (RL) for autonomous driving still struggles with safety and generalization under distribution shift, as perception-heavy encoders, sparse rewards, and ad hoc uncertainty handling often yield brittle closed-loop behavior. This work introduces a unified Deep RL (DRL) framework addressing key gaps: causal ego-centric state design, dense differentiable rewards, joint uncertainty estimation with entropy gating, and control-level policy transfer. An ego-centric relational graph encodes agent influence via uncertainty-weighted attention over kinematics, lane geometry, and semantics, producing a compact control state. A multi-objective differentiable reward stabilizes optimization by shaping safety, progress, and comfort with an uncertainty term. Aleatoric and epistemic uncertainty-captured through per-edge heteroscedastic variance and a critic ensemble-are aggregated into a calibrated confidence signal that modulates policy entropy for risk-aware exploration. A causal-semantic transfer objective aligns actions, attention, and uncertainty statistics across domains, combined with meta-learned initialization for few-shot adaptation. In closed-loop urban driving across varied towns, traffic, and weather, the framework improves success rate, reduces infractions per kilometer, and achieves higher time-to-conflict with lower lateral deviation and comfort cost compared to strong baselines.
Paperid: 3616,   Poster  
Authors: Qin Li, Wenbo Zhang, Limei Liu, Han Peng, Junfeng Yang, Guanying Xu
Title: SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition
Abstract: Multimodal intent recognition (MIR) is hindered by substantial redundancy and noise originating from text, speech, and visual inputs, which weakens feature distinctiveness and ultimately harms recognition performance. Although recent approaches based on the information bottleneck (IB) principle mitigate this issue via feature compression and reconstruction to obtain compact and noisereduced representations, they still encounter two major drawbacks. First, conventional IB employs a fixed bottleneck dimension, making it unable to accommodate sample-dependent variations in redundancy and noise. Second, simultaneously handling redundancy and noise within a single compression process leads to incomplete feature purification. In this paper, we propose a novel framework named SeD-UD, which incorporates influence-driven input-adaptive bottleneck (IDAB) modules following a hierarchically-decoupled IB strategy. Given a redundancy/noise influence factor, IDAB dynamically adjusts dimensions and selects the optimal parameters for compression and reconstruction, thereby achieving the best trade-off between information preservation and interference suppression. The IB strategy performs hierarchically-decoupled processing of redundancy and noise via separated de-redundancy and unified denoising based on IDAB modules. Extensive experiments on benchmark datasets show SeD-UD outperforms current state-of-the-art models.
Paperid: 3617,   Poster  
Authors: Chaonan Ji, Jinwei Qi, Sheng Xu, Peng Zhang, Bang Zhang
Title: FaceDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
Abstract: Existing facial reenactment methods struggle with a tradeoff between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce FaceDirector, a novel framework that reframes face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent.Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. FaceDirector achieves streaming, high-fidelity, controllable 512x512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.
Paperid: 3618,   Poster  
Authors: Qian Li, Rao Fu, Jiangtao Li, Fan Liu
Title: Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
Abstract: Unsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multiview images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process. We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student training and the recovery of high-frequency geometric details. In addition, we introduce a visibility- and geometry-aware confidence weighting that modulates teacher influence, further steering the student toward accurate surfaces in ambiguous or weakly constrained regions. Extensive experiments on various datasets demonstrate that our approach consistently improves reconstruction accuracy while maintaining competitive efficiency compared to existing UDF- and SDF-based methods.
Paperid: 3619,   Poster  
Authors: Junjie Chen, Junwei Lin, Ren Hong, Shengjie Liu, Yuming Fang, Feng Qian, Yifan Zuo
Title: Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation
Abstract: Amodal instance segmentation aims to segment both visible and occluded regions of object instance, which are challenging due to lacking inference support under occlusion. Most existing methods employ the prior knowledge about object mask (shape prior) to support the amodal estimation, but the shape prior is not always compatible for object instances in the test stage. In this paper, we explore the task of interactive amodal segmentation, where a few user clicks are available for better segmenting the complete masks of object instances.For this task, we propose a novel framework based on learning and aligning clickaware shape prior. Specifically, we propose to learn click-aware shape prior with triplet loss, which forces the retrieved shape priors to have higher IoU with the ground-truth of target instance and thus could exactly facilitate the prediction. Besides, considering the inevitable mismatch between shape prior and target instance, we propose to adaptively align the shape prior with deformable attention. Overall, our model could make full use of the interactive clicks to retrieve and align shape priors, and thus could estimate more complete masks. Extensive experiments on three benchmark datasets (i.e., KINS, D2SA and COCOA cls) demonstrate the effectiveness of our method.
Paperid: 3620,   Poster  
Authors: Kaibing Yang, Yucheng Wang, Tingzhang Luo
Title: Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery
Abstract: Onthe-fly Category Discovery (OCD) aims to dynamically identify both known and emerging unknown categories from streaming data, using supervision from only a limited set of labeled classes. Despite recent progress, our empirical analysis reveals fundamental limitations: existing methods suffer from cascading feature-to-hash degradation and severe space monopolization by known classes, fundamentally hindering novel category discovery. To address these coupled challenges, we introduce a principled two-stage framework.We first construct a Hyper-Semantic Space with dual geometric subspaces: a Derived Subspace employing parent–derived prototype augmentation to capture intra-class diversity and enhance inter-class discrimination, and a Calibrated Subspace synthesized through cross-prototype interpolation to impose distributional constraints and prevent representational collapse.Within this geometrically-constrained space, we perform Assignment-Driven Hash Learning, where Flexible Prototype Assignment (FPA) models intra-class variations and enhances inter-class separation, alongside Binary Hash Regularization (BHR) to enforce compact and discriminative hash representations. Our framework serves as a plug-and-play module, consistently improving state-of-the-art OCD methods across fine-grained benchmarks. Code will be released upon acceptance.
Paperid: 3621,   Poster  
Authors: Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi
Title: 3DrawAgent: Teaching LLM to Draw in 3D with early relative experience
Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DarwAgent, a trainingfree, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bézier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that tailors the recently proposed Generalized Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, i.e., each pair consisting of a relatively better and worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model’s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DarwAgent can generate complex and coherent 3D Bézier sketches from textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for training-free 3D sketch intelligence.
Paperid: 3622,   Poster  
Authors: Di Yang, Yaohui Wang, Shuai Shao, Francois Bremond, Jiangtao Wang
Title: PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
Abstract: Realworld human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization. We present PRISM, a PRImitive-centric Skeleton Modeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion. A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads. This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.
Paperid: 3623,   Poster  
Authors: Guannan Lai, Da-Wei Zhou, Zhenguo Li, Han-Jia Ye
Title: The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Abstract: Continual TestTime Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency–generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance.
Paperid: 3624,   Poster  
Authors: Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao
Title: MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question Answering
Abstract: Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KVcaching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables training-free and more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without the sacrifice of memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.
Paperid: 3625,   Poster  
Authors: Yikang Zhang, Rui Fan
Title: VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing highfidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects.Source code will be released upon publication.
Paperid: 3626,   Poster  
Authors: Jielun Huang, Chi-Man Pun, Guoheng Huang
Title: RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
Abstract: Recent studies have shown that Reversible Adversarial Examples (RAE) can mislead unauthorized deep neural networks while remaining usable for authorized users, effectively preventing image data leakage. Existing RAE methods rely on reversibly embedding perturbation information into the original adversarial examples to enable restoration. However, this twostage process often results in RAEs with inferior attack effectiveness and visual quality compared to the original versions. To solve these challenges, we propose a novel end-to-end Invertible Neural Network for Reversible Adversarial Examples Generation (RevINN), which directly generates RAEs in one stage by scrambling the intrinsic frequency information of images. Specifically, our RevINN consists of the Cross-Frequency Modulation Attack (CFMA) module and the High-Frequency Perturbation Enhancement (HFPE) module. CFMA selectively exchanges discriminative information between low- and high-frequency wavelet components to achieve adversariality. To fully alter high-frequency semantics, HFPE innovatively employs a tri-branch structure for fine-grained modulation among high-frequency subbands, enhancing perturbation strength. Finally, the modified components are recomposed into RAEs via the inverse wavelet transform. Our RevINN is optimized with adversarial, perceptual, and invertible losses, and can restore images based on the reversibility of the wavelet operations and network modules. Extensive experiments demonstrate that our RevINN achieves state-of-the-art RAE generation quality. The code will be released to the public.
Paperid: 3627,   Poster  
Authors: Jianghan Xia, Hong Song, Jinfu Li, Yucong Lin, Shihan Ma, Jingfan Fan, Danni Ai, Tianyu Fu, Deqiang Xiao, Jian Yang
Title: RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion
Abstract: Infrared and Visible Image Fusion (IVIF) aims to combine complementary information from infrared and visible images to overcome the limitations of a single modality. While existing methods typically employ fixed or sampleadaptive fusion paradigms where fusion weights are static or derived from global pixel distributions, they often overlook spatial inconsistencies in pixel distribution within images, leading to suboptimal performance. To address this issue, we propose RegionFuse, a Region-Adaptive Pixel Distribution Learning Network for IVIF, which dynamically generates fusion weights based on local pixel distributions to construct a region-wise adaptive fusion paradigm. RegionFuse introduces a Mixture of Region Attention (MoRA) mechanism, which assigns each region to several specialized experts, enabling region-level feature interaction tailored to specific local distributions. Furthermore, we design a Region Feature Compression Module (RFCM) and place it after each MoRA to enhance informative regions and suppress redundant ones. Extensive experiments on various benchmarks demonstrate the superiority and robustness of RegionFuse, especially in handling non-uniform pixel distributions. Evaluations on NIR-VIS and downstream tasks further confirm its generalizability and practical utility.
Paperid: 3628,   Poster  
Authors: Jun Young Kim, Joo Jeon, Sangyeon Ahn, Yoonseo Park, Yong Oh, Bogyeong Kim, Sung In Cho
Title: Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
Abstract: Although deep learningbased image super-resolution (SR) models have achieved remarkable progress in reconstruction quality, their high computational and memory demands make them unsuitable for lightweight platforms. To address this issue, various quantization techniques have been introduced. Among them, mixed-precision quantization (MPQ) introduces a layer-wise bit-width allocation to balance computational efficiency with reconstruction quality. However, existing MPQ methods based on post-training quantization (PTQ) for SR models face two critical limitations. First, quantization sensitivity estimation using static statistics fails to capture the accurate quantization error induced by each layer, resulting in suboptimal bit allocation. Second, removing batch normalization (BN) to preserve high-frequency details leads to scale inconsistencies across activations, making fixed quantization ranges insufficient to accurately represent their distribution. Therefore, we propose a novel PTQ-based MPQ framework tailored for SR models. Our method estimates the quantization sensitivity of weights and activations by leveraging gradients of the objective function with respect to bit-widths, enabling adaptive layer-wise bit allocation and fast convergence. Additionally, we introduce a dynamic activation range normalization that alleviates the distributional imbalance caused by the absence of BN, ensuring stable quantization under fixed range constraints. Our method outperforms existing PTQ-based methods by 1.26 dB in peak signal-to-noise ratio (PSNR) on the Urban100 dataset and reduces quantization time by ×1.9 for 3-bit quantization of EDSR ×4.
Paperid: 3629,   Poster  
Authors: Suyi Jiang, Gim Hee Lee
Title: PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
Abstract: Physically plausible reconstruction of human–object dynamics from a single video remains underexplored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with residual neural constitutive laws layered on expert elastic–plastic models and conditioned on per particle to capture heterogeneity and anisotropy. We stabilize monocular learning with structure-preserving 3D flow supervision and a progressive and loss-balanced training schedule. Our PhysHO reconstructs observed dynamics with high fidelity, and predicts future motion and simulates outcomes under novel human actions. Experimental results demonstrate robust human-driven interactions beyond gravity-only scenes. Our code will be released upon acceptance.
Paperid: 3630,   Poster  
Authors: Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Bhaskar, Hardik Prajapati, Cheng Peng, Rama Chellappa, Shivkumar Chandrasekaran, B.S. Manjunath
Title: WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Abstract: Aligning groundlevel imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task—which lacks suitable benchmarks—we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization. The MC-Sat dataset and Wrivinder codebase will be publicly released.
Paperid: 3631,   Poster  
Authors: Kihwan Yoon, Juyeon Shin, Jeongheum Kang, Sijung Kim, Minyong Jeon
Title: XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network
Abstract: With the rapid growth of stereoscopic 3D devices, realtime stereoscopic conversion has become increasingly essential. However, most existing approach rely on depth estimation, forward warping, and heavy inpainting network, resulting in high computational cost and artifacts near occlusion boundaries. Diffusion-based models have also been explored, but they suffer from iterative sampling and geometric inconsistency, making them unsuitable for real-time deployment. To address these issues, we propose Bi-Warp, a simple yet effective approach that synthesizes the right view without inpainting network by leveraging warping operations. Our approach estimates backward flow, approximates the corresponding forward flow, and generates two candidate right views via bidirectional warping. A learnable mask adaptively fuses the candidates, preserving left–right geometric consistency. Building on Bi-Warp, we introduce XPaintNet, a lightweight network that achieves comparable visual quality to state-of-the-art methods while maintaining real-time performance over 100 FPS at 2K resolution.
Paperid: 3632,   Poster  
Authors: Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, Jinwei Wang
Title: IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
Abstract: With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusionbased methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder accessibility for resource-constrained researchers and small teams. To address this, we propose a fine-tuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameter-free components: 1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; 2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and 3) the Noise Sensor, which introduces a Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.
Paperid: 3633,   Poster  
Authors: Yuena Lin, Haichun Cai, Yi Shan, Hao Wei, Yongjian Deng, Zhen Yang, Gengyu Lyu
Title: DF$^2$-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification
Abstract: Multiview multi-label classification (MVMLC) aims to utilize both consensus and complementarity information to predict potentially relevant labels for samples. Existing MVMLC approaches typically focus on either feature-level fusion, which integrates complementary features for more expressive representations, or decision-level fusion, which aggregates view-specific predictions to exploit label supervision more effectively. In fact, relying solely on feature-level fusion often underutilizes label information and limits discriminability of learned representations, whereas pure decision-level fusion pays insufficient attention to view representation expressiveness and thus constrains classification performance. To address these limitations, we propose DF^2-VB, a dual-level fusion framework that jointly exploits complementary strengths to mitigate their respective weaknesses by integrating feature- and decision-level fusion. At the feature level, a Fuzzy Dynamic Fusion (FDF) module maps consensus features into a more compatible fuzzy feature space, where essential features are identified and redundant features are suppressed to further fuse an expressive consensus representation and boost view-specific predictions for decision-level fusion. At the decision level, a View-specific Boosting (VB) strategy adaptively measures the importance of samples and view-specific predictions to strengthen the utilization of supervision for facilitating the discriminability in feature-level fusion. Complementarily, FDF and VB jointly reinforce the model expressiveness and discriminability for reliable predictions. Extensive experiments on multiple public datasets verify the superiority of our strategy over advanced MVMLC models.
Paperid: 3634,   Poster  
Authors: Gaowei Zhang, Lihe Zhang
Title: DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
Abstract: Zeroshot anomaly detection (ZSAD) aims to utilize auxiliary data to train models for generalized learning of unseen categories, which has important application value in fields such as industrial quality inspection and medical diagnosis. Although methods based on CLIP show potential, their pre-training objective of focusing on overall semantic alignment between images and text makes the model insensitive to local details, which is inherently contradictory to the need for fine-grained local features in anomaly detection. Existing improvement methods rely on predefined text prompt frameworks to perceive local information, but struggle to effectively address the issue of insufficient local perception. To address this, this paper proposes a dynamic local visual prompting method based on CLIP (DLVP-CLIP). DLVP dynamically identifies and extracts local visual features from key regions in images as prompt tokens using the Semantic-Aware Local Feature Selector (SLFS) module, and utilizes the multi-modal local prompt (MLoP) module to jointly optimize representations in both visual and textual spaces, achieving more precise cross-modal alignment. Additionally, the high-low frequency decomposition module (HFD) is introduced to separate and process global structural and local textural information via wavelet transformation, thereby enhancing detail perception. Extensive experiments on 13 anomaly detection datasets demonstrate that DLVP-CLIP achieves outstanding ZSAD performance on datasets from the industrial and medical domains.
Paperid: 3635,   Poster  
Authors: Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo
Title: Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Abstract: We present Vanast, a unified framework that generates garmenttransferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front–back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment–posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.
Paperid: 3636,   Poster  
Authors: Xinkuan Qiu, Meina Kan, Zhenliang He, Yongbin Zhou, Shiguang Shan
Title: Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
Abstract: Large vision–language models (LVLMs) are highly vulnerable to visual corruptions, substantially compromising their reliability and limiting realworld deployment. Prior work has attributed this degradation primarily to insufficient visual grounding and overreliance on language priors. However, these explanations often overlook the heterogeneous nature of corruptions, which perturb model perception in fundamentally different ways. We revisit this problem from a corruption-centric perspective and show that diverse corruptions can be organized along two complementary perceptual dimensions—shape and texture—which induce distinct failure modes. To address them, we propose Shape–Texture Dual-Path Contrastive Decoding (ST-CD), a training-free inference framework that constructs complementary contrastive views to diagnose and correct shape- and texture-induced biases through adaptive fusion. Experiments across multiple LVLMs and robustness benchmarks demonstrate that ST-CD consistently improves robustness under heterogeneous corruptions, suggesting that leveraging the complementarity between shape and texture provides a general and effective principle for building robust multimodal models.
Paperid: 3637,   Poster  
Authors: Xinlong Li, Di Lin, Shaoyiyi Gao, Yaxuan Liu, Jixian He, Jiaxin Li, Ruonan Liu, Qing Guo, Kairui Yang, Wei Feng
Title: HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
Abstract: Openvocabulary part segmentation (OVPS) aims to segment objects into fine-grained parts while generalizing to unseen categories. Existing VLM-based methods face two challenges: (1) object over-segmentation, caused by overly broad semantic activations, and (2) part under-segmentation, resulting from weak fine-grained perception. To address these issues, we propose HOPS, a two-stage framework for hierarchical open-vocabulary part segmentation. HOPS introduces a bidirectional semantic–structural attention fusion mechanism that integrates CLIP’s semantic alignment with DINO’s structural perception. In the object segmentation stage, the Attention-Aware Filtering Module (AFM) refines cross-modal similarity maps via semantic–structural attention to suppress object over-segmentation. In the part segmentation stage, the Affinity-Guided Enhancement Module (AEM) iteratively propagates part responses to progressively expand activation regions, effectively mitigating part under-segmentation. Experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet demonstrate that HOPS achieves state-of-the-art performance with superior generalization.
Paperid: 3638,   Poster  
Authors: Bo Sun, Peixi Peng, Guang Tan, Haoran Xu, Yaokun Li, Yiqian Chang, Shuaixian Wang, Luntong Li
Title: Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
Abstract: This paper proposes the Continual DualCritic with Cross-Attention (CD-CCA) framework for visual reinforcement learning to address the plasticity-stability conflict. Our method introduces continual learning techniques into the visual RL architecture, constructing two complementary critics using Continual Backpropagation (CBP) and Elastic Weight Consolidation (EWC) -- one for maintaining representational plasticity for rapid environmental adaptation, and the other for preserving knowledge stability to prevent catastrophic forgetting. Furthermore, we design a cross-attention based fusion mechanism that balances the value estimates from the dual critics according to observation characteristics. Experimental results on DeepMind Control and CARLA benchmarks show that CD-CCA effective mitigates issues of representation drift and policy degradation. Compared to existing visual RL methods, our approach exhibits enhanced robustness and adaptability in non-stationary environments and long-horizon decision-making tasks, providing a new architectural paradigm for the advancement of continual reinforcement learning.
Paperid: 3639,   Poster  
Authors: chengan che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis Carlos Garcia Peraza Herrera
Title: A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
Abstract: Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current selfsupervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates twonovel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.
Paperid: 3640,   Poster  
Authors: Zehan Zhang, Yaoyi Li, Neng Zhang, Jia Cai
Title: Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
Abstract: Learningbased motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.
Paperid: 3641,   Poster  
Authors: Fengbei Liu, Sunwoo Kwak, Nusrat Binta Nizam, Ilan Richter, Ashley Beecy, Jayant Raikhelkar, Deborah Estrin, Mert Sabuncu
Title: RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis
Abstract: VisionLanguage Models (VLMs) are increasingly adopted for medical applications, but their clinical utility is limited by a core weakness in quantitative reasoning. This limitation affects tasks ranging from regression of lesion sizes to prediction of bounding-box coordinates and stems from the discrete tokenization schemes underlying Large Language Models (LLMs). To address this, we propose \emphRotary Number Encoding and Decoding (RNED), a principled method for embedding continuous numerical values directly in the representation space of a VLM. Analogous to rotary position encoding, RNED represents a scalar by applying a number-specific rotation matrix to a dedicated numeric token embedding. This norm-preserving transformation maintains ordinal structure over a wide numerical range and integrates seamlessly with pretrained model weights. For decoding, we introduce a robust score-matching–based scheme to recover continuous values from hidden states in the presence of stochastic noise. We evaluate RNED on two quantitative tasks: radiological measurement estimation and medical visual grounding. On both internal and public benchmarks, RNED consistently outperforms existing VLM baselines. Together, these results show that RNED offers a robust, generalizable solution for numerical reasoning in medical VLMs, enabling models that are both quantitatively reliable and clinically applicable. We will release code for experiments on public datasets.
Paperid: 3642,   Poster  
Authors: Zhipeng Liu, Chunbo Luo
Title: CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
Abstract: Vision–language models (VLMs) enable textguided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground–aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2’s aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3× reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.
Paperid: 3643,   Poster  
Authors: Weiwei Duan, Luping Ji, Shipeng Lei, Sicheng Zhu, Jianghong Huang, Mao Ye
Title: CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection
Abstract: Infrared small target detection is one highly special category of object detection, faced with tiny target imaging size and cluttered backgrounds. Currently, almost all existing methods are targetcentered, directly learning the target features from backgrounds. However, due to weak target signals, they are often difficult in effectively capturing stable features. Sometimes, they cannot even distinguish real targets from background confounders. To overcome these problems, from an opposite perspective, we propose the first Causal-guided Hierarchical Anomaly-aware Learning (CHAL) framework. Breaking through target-centered paradigm, it focuses on background learning, while the targets are handled as the anomalies in backgrounds. In detail, to fulfill the goal, a spatio-temporal neural field is designed to model the background evolution patterns from generative perspective. Meanwhile, a hierarchical anomaly-aware learning is proposed to decompose anomaly discovery. Furthermore, to block the spurious correlations often caused by background confounders, and enhance true target causality, a causal-guiding mechanism is designed. The experiments on three infrared datasets verify the effectiveness and superiority of our CHAL. Even in visible-light scenarios, it still possesses obvious adaptivity. Source code will be open.
Paperid: 3644,   Poster  
Authors: Guojun Xu, MingyangZhang MingyangZhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou
Title: Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
Abstract: Distributed Image Compression (DIC) is crucial for multiview transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: it first guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; then, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.
Paperid: 3645,   Poster  
Authors: Qingan Zhang, Wensheng Li, Chengying Gao
Title: IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
Abstract: Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under highilluminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the view-dependent “shiny” appearance of reflective surfaces in real time, surpassing the limits of prior 3DGS-based inverse rendering methods.
Paperid: 3646,   Poster  
Authors: Muyang Li, Yucheng Liu, Jianbo Ma, Elliot Osborne, Bo Han, Tongliang Liu
Title: Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Abstract: VisionLanguage Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
Paperid: 3647,   Poster  
Authors: Wenxi Li, Jingchen Huang, Chenyang Lyu, Mo-Ran Liu, Haozhe Lin, Guiguang Ding, Yuchen Guo
Title: ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
Abstract: Recent advances in gigapixellevel imaging have brought High-Resolution Wide shots to the forefront of research. However, these images present significant challenges: extreme sparsity of foreground, gigapixel-level resolutions and diverse target counts. This makes traditional close-up detectors inaccurate and slow as they are overwhelmed by the background. Although previous research has explored sparse backbones, their fixed sparsity patterns lack the adaptability required to handle diverse target numbers.To address this, we introduce ElasticFormer, a sparse backbone that dynamically allocates computational resources based on foreground proportion. After scoring windows based on variance, proposed ElasticSelector module will predict the foreground proportion for top-k selection. The mechanism guides the model to select target-containing windows, scaling resources in areas where objects are clustered.We introduce a novel loss function combined with the 3-phase training strategy for ElasticSelector, allowing it to function properly when bounding box annotations are missing. A WSOD study is carried on PASCAL VOC 2007 to evaluate its extensibility. Further, ElasticNet is created to verify its backbone-agnostic nature. In experiments on the PANDA gigapixel benchmark, ElasticFormer reduces backbone FLOPs by 80% while achieving a significant improvement in AP_50 when compared to fixed-ratio sparse methods.
Paperid: 3648,   Poster  
Authors: Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Min Yang
Title: SafeRoPE:Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Abstract: Recent Textto-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images.Motivated by these insights, we propose SafeRoPE, a training-free and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combine both head-wise LRS and RoPE perturbation to perform risk-specific head-wise rotation on embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves the SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT.
Paperid: 3649,   Poster  
Authors: Muxi Chen, Zhaohua Zhang, Chenchen Zhao, Mingyang Chen, Wenyu Jiang, Tianwen Jiang, Jianhuan Zhuo, Yutang Yutang, Qiuyong Xiao, Jihong Zhang, Qiang Xu
Title: $\textbf{FailureAtlas}$: Mapping the Failure Landscape of T2I Models via Active Exploration
Abstract: Static benchmarkdriven evaluation has provided a valuable foundation for analyzing Text-to-Image (T2I) models.However, the fixed and predetermined prompt sets in benchmarks inherently limit diagnostic depth, making it difficult to uncover the full landscape of models' systematic failures or isolate their root causes.We argue for a complementary paradigm: active exploration, and introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscapes of T2I models at scale.Unlike benchmarks that evaluate a fixed prompt set, FailureAtlas performs guided exploration in the input space, framing error discovery as a structured search for minimal, failure-inducing concepts. While this is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (e.g., over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI.
Paperid: 3650,   Poster  
Authors: Chenxi Du, Yongheng Deng, Jiani Liu, Yujia Zhang, Xi Chen, Ju Ren
Title: CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models
Abstract: Large Multimodal Models (LMMs) have shown remarkable success in visual understanding tasks. LMMs encode visual and textual inputs into tokens, which are then processed by Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising trainingfree solution, but existing methods remain limited. Importance-based approaches suffer from poor generalization, are incompatible with kernel-level inference optimizations, and only consider information from a single modality. Diversity-based strategies typically focus on pairwise token redundancy and treat all tokens as equally important. Recent attempts to sequentially combine importance and diversity criteria still fail to address the intrinsic drawbacks of their underlying metrics. To address these limitations, we reformulate visual token reduction as an optimal subset selection problem jointly guided by two complementary objectives: informativeness and coverage. Informativeness is quantified through per-token intrinsic saliency and visual–textual alignment, while coverage is enforced via a volume-based subset selection criterion that ensures global representativeness in the visual feature space.This joint formulation effectively integrates visual saliency, cross-modal alignment, and global coverage in an end-to-end token selection process, yielding a computationally efficient, model-agnostic framework compatible with modern inference accelerators. Extensive experiments demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance. We will release our code once accepted.
Paperid: 3651,   Poster  
Authors: Zeyu An, Wanyu Lin, Feng Tan, Shujun Wang
Title: MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Abstract: Recent advances in diffusionbased language models (DLMs) have shown remarkable potential for de novo protein design. However, enabling controllable protein generation requires integrating diverse biological conditions, such as structure, functions, and chemical interactions, each represented in distinct modalities. Existing approaches often either support a single condition or treat multiple conditions through separate modality-specific encoders. This isolation limits cross-modal interaction, reduces generation quality, and complicates the incorporation of new conditions without retraining or redesigning the backbone. To address these limitations, we introduce MMCP-GEN, a DLM for Multi-Modal, Multi-Condition Protein sequence GENeration. MMCP-GEN establishes a new paradigm for controllable protein generation under complex multimodal constraints. Its core is a modality-composable and extensible conditioning mechanism that fuses heterogeneous biological conditions via learnable queries and modality-indicator heads, enabling disentangled, extensible, and cross-modal condition integration without retraining the backbone. A joint generation–and–scoring objective further aligns sequence recovery with structural fidelity. Empirically, MMCP-GEN achieves state-of-the-art performance across structure-, function-, and ligand-conditioned tasks, improving sequence recovery by up to 5% and outperforming attentive baselines in diverse functional annotation tasks. These results establish MMCP-GEN as a general and high-fidelity framework for controllable protein generation.
Paperid: 3652,   Poster  
Authors: Zhicong Tang, Jingye Chen, Zhao Zhang, Mohan Zhou, Yuchi Liu, Yifan Pu, Yalong Bai, Ethan Smith, Yuhui Yuan
Title: Masked Region Transformer for Layered Image Generation and Editing at Scale
Abstract: Layered image generation and editing is a fundamental capability that enables layerwise reuse, editing, and composition of the generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present the Masked Region Transformer, a 20B-parameter diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make three key technical contributions. First, we unify three complementary tasks---text-to-layers, image-to-layers, and layers-to-layers---within a shared masked region diffusion framework, where selective token masking enables flexible cross-modal generation and fine-grained layer-wise editing. Second, we design an efficient conditional diffusion decoder that incorporates Gated DeltaNet and gated attention mechanisms, enhancing visual fidelity while maintaining computational efficiency. Third, we introduce an overflow-aware canvas layer to handle boundary inconsistencies and support semi-transparent background synthesis, enabling complete editable layer generation beyond visible canvas boundaries. Additionally, we apply distribution matching distillation to achieve one-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches across all three tasks, establishing a new benchmark for region-aware transparent image generation.
Paperid: 3653,   Poster  
Authors: Weicheng Du, Wenjia Meng, Zhengzhe Zhang, Yilong Yin, Xiankai Lu
Title: TSTM: Temporal Segmentation for Task-related Mask in Visual Reinforcement Learning Generalization
Abstract: Achieving strong policy generalization to unseen environments remains a core challenge in visual reinforcement learning, and segmenting taskrelevant regions to mitigate the influence of irrelevant visual cues has emerged as a promising direction. However, existing methods rely solely on the current observation, lack temporal information, and fail to exploit preceding observations, leaving learned policies susceptible to task-irrelevant background variations and ultimately degrading policy performance. In this paper, we propose temporal segmentation for task-relevant mask in visual reinforcement learning, named TSTM, which extracts task-relevant regions from sequential observations by exploiting temporal information, thereby producing more reliable masks and improving policy generalization. TSTM introduces a temporal segmentation network with an encoder-temporal-decoder architecture, where a convolutional LSTM module captures temporal dependencies across observations. To reduce inference overhead, we further develop a lightweight student network as an efficient substitute for the teacher network. The resulting task-relevant masks are encoded by a CNN-based encoder, and invariant representation learning is employed to improve robustness by enforcing consistency between representations of the original and augmented observation sequences. With these task-relevant representations, we train an actor-critic agent to learn a policy with strong generalization capability. Experimental results demonstrate that TSTM achieves superior generalization performance over existing state-of-the-art methods on most visual RL tasks.
Paperid: 3654,   Poster  
Authors: Shuilian Yao, Qi Jia, Qi Jia, Zhang pengshuo, Lili Sun, Weimin Wang, Yanmei Zhu, Bo Zhang, Xin Fan
Title: Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
Abstract: Lymph node metastasis diagnosis in pathological images is a highly challenging fourclass classification task, comprising macrometastasis, micrometastasis, isolated tumor cells (ITC), and negative lesions.Unlike conventional classification settings, this four-class scenario simultaneously suffers from inter-class and intra-slide scarcity of minority information.Existing approaches based on CNNs or GNNs primarily emphasize node-level feature learning, making it difficult to capture high-order feature interactions and topological dependencies among cells, while also overlooking the representational insufficiency induced by class scarcity.To address these challenges, we propose a dual-level generative framework that integrates class-prompt priors with high-order structural modeling to enhance the representation capacity of minority classes.At the hypergraph level, we develop a prompt-guided hierarchical hypergraph variational autoencoder (HGVAE) capable of generating diverse and topologically consistent hypergraph representations for minority classes.At the hypernode level, we introduce an anchor-diffusion mixup strategy to enrich the minority node features of high-attention positive anchor nodes.Extensive experiments on the four-class NIMM dataset, as well as TCGA datasets, demonstrate that the proposed framework effectively alleviates feature scarcity and significantly boosts the classification performance of minority classes.
Paperid: 3655,   Poster  
Authors: Arnav Chavan, Nahush Lele, Udbhav Bamba, Sankalp Dayal, Aditi Raghunathan, Deepak Gupta
Title: S2D: Selective Spectral Decay for Quantization Friendly Conditioning of Neural Activations
Abstract: Activation outliers in largescale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay (S^2D), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that S^2D significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with S^2D achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.
Paperid: 3656,   Poster  
Authors: Yinan Deng, Kejia Hu, Ye Chen, Jianyu Dou, Jiahui Wang, Jingyu Zhao, Haojia Ao, Yi Yang, Yufeng Yue
Title: Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
Abstract: Scalable robot learning is hindered by the high cost of acquiring diverse, highquality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under novel object arrangements. Furthermore, by augmenting backgrounds, textures, lighting, and camera views, Video2Robo further enhances the diversity of generated data. Extensive evaluations in both simulation and real-world environments demonstrate that policies trained on Video2Robo data achieve superior generalization and transfer performance.
Paperid: 3657,   Poster  
Authors: Yudi Xie, Zhongao Zhou, Bin Yang, Zhenghan Chen, Mang Ye
Title: Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification
Abstract: Privacypreserving Person Re-Identification (PP-ReID) addresses the core privacy-utility trade-off in Re-ID by retrieving a person across multiple non-overlapping cameras while applying anonymization techniques to protect sensitive information. However, prior PP-ReID studies are confined to single-modality visible scenarios, as 24-hour surveillance systems require robust cross-modal visible-infrared (VI) capabilities. Extending PP-ReID to the cross-modal VI setting is therefore crucial for 24-hour surveillance. Accordingly, we introduce a new task: Privacy-Preserving Visible-Infrared Person Re-Identification (PP-VI-ReID). This task presents two severe challenges: 1) Crude anonymization strategies destroy identity-critical information and disrupt cross-modal alignment 2) The anonymization process creates inconsistent distortions across modalities. It disrupts color-based textures in visible images while obscuring thermal contours in infrared images. This inconsistency with modality gap forms a Mixed Gap. To overcome these challenges, we propose a framework, the Precise Privacy-preserving and Alignment Network (PPA) with two components: 1) A Keypoint-Preserving Regularization (KPR) module leverages human pose as a prior to guide structure-aware anonymization, preserving essential body features. 2) A Differential Consistency-guided Modality Alignment (DCMA) module. It treats anonymization perturbations not as varying noise, but as a stable, learnable offset, facilitating robust alignment between raw and anonymized features across modalities. Experiments on SYSU-MM01 and RegDB validate our framework, establishing a strong baseline for this task. The source code will be released.
Paperid: 3658,   Poster  
Authors: Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, Benjamin Sapp, Mingxing Tan, Jyh-Jing Hwang, Dragomir Anguelov
Title: Rare-E2E: Rare Events Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Abstract: Visionbased end-to-end (E2E) driving has garnered interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios paired with existing open-loop evaluation metrics that fall short in capturing the multimodal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Rare Events Dataset for End-to-End Driving (Rare-E2E). Rare-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Each segment in Rare-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional distance-based metrics, RFS measures how closely a predicted trajectory matches rater-annotated trajectory preference labels. Rare-E2E includes rater preference labels for validation, and a separate held out test set is used for the 2025 Rare-E2E benchmark leaderboard.
Paperid: 3659,   Poster  
Authors: Yuchen Qin, Yizhi Zhou, Junxiao Wang, Xin Xie, Heng QI
Title: FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement
Abstract: Growing privacy and security demands in the real world have spurred interest in adversarially robust Federated Learning (FL). While Adversarial Training (AT) is a wellestablished defense in centralized learning, its extension to the federated setting, known as Federated Adversarial Training (FAT), faces significant challenges due to data heterogeneity across clients. Existing FAT methods have made significant contributions, but they typically assume a balanced global data distribution, an assumption that rarely holds true in practice due to the prevalence of long-tailed distributions. This work first identifies and diagnoses the severe performance degradation of FAT under long-tailed data, attributing it to skewed feature representations and impaired classifier discriminability. To address this, we propose FedCART, a novel FAT framework that decouples the model into a shared feature extractor and a dual-classifier structure. On the client side, a representation-alignment loss enhances adversarial robustness, while gradient-based class prototypes are extracted for classifier calibration. On the server side, models and prototype sets are aggregated to synthesize balanced virtual features, enabling the re-training of an auxiliary classifier to mitigate long-tailed bias. Extensive experiments demonstrate that FedCART significantly improves both accuracy and robustness, outperforming state-of-the-art FAT methods. To the best of our knowledge, this is the first work to systematically investigate and address FAT under long-tailed distributions, representing a significant step toward practical adversarial robustness in FL. Our code will be publicly available upon acceptance.
Paperid: 3660,   Poster  
Authors: Jiaqi Chen, Qinfu Xu, Liyuan Pan
Title: DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
Abstract: Human Action Recognition (HAR) is a fundamental computer vision task with diverse realworld applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event–IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 real-world clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.
Paperid: 3661,   Poster  
Authors: Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He
Title: SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Abstract: Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing handobject interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. The dataset will be publicly released upon acceptance.
Paperid: 3662,   Poster  
Authors: Songyuan Yang, Guijian Tang, Kun Hu, Haotian Wang, Shixuan Liu, Wenjing Yang, Long Lan, Huibin Tan
Title: Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination
Abstract: Hallucination limits the reliability of multimodal large language models (MLLMs), and it is particularly damaging in video where errors manifest as distorted narratives rather than singleframe mistakes. We introduce a frame-first study of Chimera Hallucination: model stitches visual segments that exist in space and time but do not belong to the same event chain, producing a spurious continuous story. We introduce CH-Risk, a single-forward, reference-free risk estimate tailored to this failure mode. CH-Risk combines two complementary signals: SegCoverage@\alpha (\mathrmSCR@\alpha\) measures how many event segments are needed to cover most text-to-frame support, exposing long-range stitching; Alignment with Early Temporal Pathway (AETP) measures rank consistency between support and the temporal pathway formed in early–middle layers, exposing stage mismatch. To turn risk into correction, we further propose CH-M(itigation), a train-free two-stage intervention. Segment-aligned Stage-Aligned Frame Routing (sSAFR) re-weights frames before the mid-layer softmax to route attention toward a small set of pathway-aligned segments. Residual Token Calibration (RTC) then stabilizes token usage within selected segments. Extensive experiments across 9 benchmarks and 6 VideoLLMs show that CH-Risk can predict Chimera and that CH-M consistently reduce hallucination and improves task accuracy with negligible overhead (sub-5% latency, sub-2.5% memory, \\approx1% FLOPs).
Paperid: 3663,   Poster  
Authors: Wenxue Cui, Hualin Li, Yuhang Qin, Yifu Xu, Xiaopeng Fan, Debin Zhao
Title: Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing
Abstract: Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent illposedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.
Paperid: 3664,   Poster  
Authors: Shaoqian Wang, Jiadai Sun, Bosen Hou, Qiang Wang, Bin Fan, Bo Li, Bin Lu, Yuchao Dai
Title: SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
Abstract: Learningbased Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation, substantially improving robustness in challenging areas. Furthermore, we propose a Monocular Depth-guided Enhancement (MDGE) module that enhances depth probability distributions using monocular depth priors, thereby further boosting the depth estimation performance. Extensive experiments demonstrate that our method significantly improves reconstruction quality in difficult regions and achieves state-of-the-art (SOTA) performance on multiple benchmarks.
Paperid: 3665,   Poster  
Authors: Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu
Title: AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence
Abstract: Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often limited to specific object shapes due to the constrained data diversity. Leveraging powerful 3D generative models and vision foundation models (VFM), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across largescale 3D meshes to generate new robot manipulation tra-jectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly im-proving data efficiency in robot learning.
Paperid: 3666,   Poster  
Authors: Yongshan Zhang, Xiaohuan Lin, Lefei Zhang, Zhihua Cai
Title: Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
Abstract: Multiview clustering for remote sensing data has received increasing attention by leveraging diverse data representations to enhance Earth observation. Existing methods are primarily developed under the assumption that each pixel is fully observed across all views. No prior work has investigated the more practical yet challenging scenario where some views suffer from partially missing data. To bridge this gap, this paper presents the first study on clustering incomplete remote sensing data, termed orthogonal spatial-aware multi-view anchor graph clustering (OSMAGC). Specifically, spatial-aware anchors and multi-scale anchor graphs are initially constructed by exploiting the superpixel-based texture characteristics of each view. Based on these, multi-scale anchor graph learning is performed through view weighting and matrix factorization on incomplete data. Structure-aligned consensus feature learning is achieved by aligning the multi-scale graph structures within a shared latent space. To ensure spatial continuity and smoothness, orthogonal spatial-aware regularization is imposed in both horizontal and vertical directions. These three modules are jointly optimized through a well-designed optimization algorithm in a mutually reinforcing manner. Extensive experiments on four benchmark datasets validate the effectiveness and efficiency of our proposed method over the state-of-the-art competitors.
Paperid: 3667,   Poster  
Authors: Dazhong Shen, Jingjing Gu, Qiang Zhou, Meng Zhao, Ying Sun
Title: LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
Abstract: Millimeterwave radar’s all-weather capability makes it increasingly vital for autonomous perception. However, the high cost of radar data collection drives the need for data generation to augment radar datasets. Existing works mainly target partial radar representations, e.g., 2D or 3D slices, leading to information loss and limited downstream performance. To overcome these issues, we introduce the novel task of LiDAR-to-4DRadar translation, which generates complete 4D radar tensors, with three spatial and one Doppler axes, guided by LiDAR data that preserve spatial and semantic consistency. We propose a novel diffusion bridge model in an aligned LiDAR-4DRadar latent space, namely L2RLDB, to tackle this task. Specifically, first, a key-voxel-aware VAE compresses high-dimensional, noisy radar tensors into a compact latent space, while enabling precise numerical reconstruction and key-voxel identification. Second, to bridge the cross-modal gap between sparse 3D LiDAR and dense 4D radar, we develop a patch-wise contrastive learning module to align LiDAR latents with radar semantically and spatially. Finally, we formulate the translation as a diffusion bridge process between LiDAR and radar latents, enabling the synthesis of full radar tensors from Doppler-lacking LiDAR inputs. Experiments verify that L2RLDB achieves high-fidelity 4D radar generation and significantly improves downstream detection through data augmentation.
Paperid: 3668,   Poster  
Authors: Shuwei Shao, Kejin Zhu, Shixing Ma, Xinzhe Du, Baochang Zhang, Zhe Min
Title: Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
Abstract: Monocular depth estimation serves as a core technique in endoscopic applications such as 3D reconstruction and localization. However, most existing methods focus primarily on indomain depth estimation, which limits their robustness and prevents them from delivering impressive cross-domain performance, due to variations in depth distributions, illumination conditions, and texture patterns. In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. To specify, we develop a dual-level Mixture-of-Experts (MoE) adaptation paradigm that effectively tailors Vision Foundation Models to diverse endoscopic procedures, such as laparoscopy and colonoscopy, accounting for the challenges posed by varying environments. Internally, we integrate LoRA and Adapter modules within the MoE architecture, allowing the model to flexibly adapt to the characteristics of input data. Externally, a mixture of domain-specific experts provides customized guidance to enhance the training stability. In addition, we introduce a learnable gradient harmonization mechanism to dynamically balance the optimization between the depth and pose networks, along with a semantic distribution calibration module that strengthens the semantic consistency of depth predictions. Extensive experiments demonstrate that the proposed DAE achieves state-of-the-art performance in both zero-shot and in-domain depth estimation scenarios.
Paperid: 3669,   Poster  
Authors: Yuwen Pan, Yuan Wang, Shaohui Li, Zhi Li, Yu LIU, You He
Title: From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection
Abstract: Zeroshot anomaly detection (ZSAD) aims to detect unseen anomalies without any abnormal supervision, which is crucial for open-world scenarios where anomalies are diverse and unpredictable. By expressing normal and abnormal concepts in natural language, recent vision–language models such as CLIP enable anomaly reasoning through shared visual–textual embeddings. However, existing approaches rely on coarse prompt fusion, resulting in unstable alignment and inaccurate localization under domain shifts. To overcome these challenges, we propose the Semantic Graviton Network (SGNet), a physics-inspired framework that models multimodal alignment as an adaptive potential field. We introduce semantic gravitons, learnable dynamic mediators that bridge visual and textual modalities by establishing localized semantic equilibria through attraction and equilibrium forces. Within this framework, a graviton interaction network alternately performs text-to-graviton and vision-to-graviton coupling, progressively refining multimodal correspondence and promoting structured semantic binding. Furthermore, an energy-based potential regularization, composed of attraction and equilibrium forces, constrains the evolution of these interactions, ensuring stability and interpretability in the learned representations. Extensive experiments on ten industrial and medical benchmarks demonstrate that SGNet achieves state-of-the-art zero-shot anomaly detection performance.
Paperid: 3670,   Poster  
Authors: Abdullah Azeem, Ruisheng Wang, Qingquan Li, Abubakar Siddique
Title: Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing
Abstract: Autonomous object detection in remote sensing requires systems that can discover new categories and assign them usable labels during deployment. Existing OpenWorld Object Detectors identify unknown objects but leave them unnamed until manual annotation. In contrast, Open-Vocabulary Detectors recognize unseen categories only with provided prompts at test time, lacking autonomous discovery or naming. This work presents HSGDet, a detector that achieves both discovery and semantic assignment at test time without external prompts. This method introduces DHGA that navigates a hierarchical semantic graph to perform scene-conditioned coarse-to-fine classification of detected objects. It leverages spatial co-occurrence patterns from surrounding scene context to produce classification confidence scores. High-scoring regions are identified as known objects, while low-scoring regions are flagged as unknown detections. Unknown regions pass to CR2T, which synthesizes text embeddings by fusing visual features, hierarchical parents, and scene context, enabling prompt-free labeling and vocabulary expansion. This approach enables prompt-free semantic labeling and supports autonomous vocabulary expansion without requiring external models. Results demonstrate that HSGDet outperforms state-of-the-art methods by a large margin of 6.6 points in Known mAP and 9.9 points in Unknown Recall. It also reduces Wilderness Impact by 36%, enabling scalable and autonomous aerial monitoring.
Paperid: 3671,   Poster  
Authors: Yating Liu, Zhaoshuai Qi, Yang Zou, Yongnan Yang, Shizhou Zhang, Yanning Zhang
Title: OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
Abstract: Estimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD modelfree setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view. While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framework via orientation-aware NVS from a single image. Specifically, we introduce the Orientation-Aware Guidance, which explicitly injects object orientation cues into the reference latent embedding to enhance orientation awareness during viewpoint transformation. We also introduce an orientation consistency loss that supervises viewpoint transformation at the geometric level, establishing sufficient supervision for explicit and geometry-consistent transformation guidance beyond pixel-level similarity. This loss justifies estimating the reference orientation rather than using its ground-truth pose, thereby ensuring the alignment of coordinate domains between the injected and supervised priors. Extensive experiments demonstrate that OrienPose achieves state-of-the-art performance in single-view unseen object pose estimation and impressive robustness to image degradations.
Paperid: 3672,   Poster  
Authors: Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero
Title: Charge: A Comprehensive Benchmark and Dataset for Dynamic Novel View Synthesis
Abstract: This paper presents a new dataset for Novel View Synthesis, generated from a highquality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
Paperid: 3673,   Poster  
Authors: Shiwei Gan, Xiao Liu, Yafeng Yin, Nan Liu, Kuizhuang Liu, Desibieer Tuerdaken, Zhiwei Jiang, Lei Xie, Sanglu Lu, Hongkai Wen
Title: Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
Abstract: Selfsupervised learning (SSL) has achieved remarkable success across both NLP and CV domains. However, sign language translation (SLT) models still heavily rely on gloss annotations in gloss-based SLT or text annotations in gloss-free SLT (GFSLT) during pretraining, aiming to ensure that the backbone provides effective sign language (SL) features for the translation model. Such reliance restricts the scalability and generalization ability of the SLT model. One natural question arises: Can existing SSL methods be directly applied to the SL domain to train an effective sign feature extractor for downstream GFSLT tasks, eliminating the need for text annotations?In this paper, we propose a simple yet effective pretraining framework with two goals:(1) decoupling the pretraining process from gloss or text annotations, relying purely on sign frames; and(2) only global frames are required during inference for simplicity. We show that directly applying existing SSL methods yields suboptimal performance, as SL features involve subtle motion patterns and discriminative cues that are often confined to local regions. To achieve this, we introduce SignDINO, a simple yet effective sign-aware DINO training strategy that learns effective and semantically meaningful representations from global frames without any textual supervision. Specifically, a student–teacher architecture is employed, where the teacher model receives the global sign frame, while the student model learns from masked local views that preserve only the hand and facial regions. Such a simple design encourages the model to infer global semantics from discriminative local cues, allowing the teacher model to extract SL-related feature during inference solely based on global views. Extensive experiments on public SL datasets show that SignDINO achieves highly competitive performance on the GFSLT task without relying on extra cues or additional SL-related pretraining.
Paperid: 3674,   Poster  
Authors: Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen
Title: HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
Abstract: Evaluating the nuanced humancentric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 27 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.
Paperid: 3675,   Poster  
Authors: Hanjing Lin, Jiahua Rao, Youhan Sun, Jiancong Xie, Yuedong Yang
Title: Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
Abstract: Understanding genotype–phenotype relationships is pivotal for advancing biomedical research, drug discovery, and precision medicine. With the rise of highthroughput cellular imaging, it is essential to tightly integrate high-content cellular morphology with structured biological knowledge to extract cellular-scale evidence for genotype-to-phenotype mapping.However, integrating high-dimensional, heterogeneous, and noisy phenotypes with structured knowledge remains challenging. Prior approaches typically treat phenotypes as node features, overlooking that phenotypes primarily convey cellular-scale relational signals about how perturbations reshape interactions. We present KERNEL, a knowledge-guided multimodal graph learning framework that integrates cellular imaging phenotypes into a unified knowledge graph to predict genotype-phenotype interactions, including GRN inference, drug-target interaction prediction, and subtype-specific subnetwork discovery. KERNEL dynamically augments task-relevant edges from noisy phenotypic signals, explicitly learns per-edge confidence and marginal utility, and uses knowledge gating to align graph topology with mechanistic pathways. Across large-scale imaging and single-cell datasets, KERNEL consistently outperforms state-of-the-art baselines, e.g., up to 38.1% AUPR improvement for GRN inference, while delivering more accurate and interpretable DTI and subtype subnetwork discovery, demonstrating robust mechanism learning from richer, harder-to-denoise phenotypes.
Paperid: 3676,   Poster  
Authors: Yiheng Yu, Sheng Liu, Yuan Feng, Zhelun Jin, Yining Jiang, Min Xu
Title: Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
Abstract: Sign Language Production (SLP) aims to translate spoken language into sign sequences, where the main challenge lies in generating coherent and natural poses from discrete glosses (G2P). Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture finegrained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal–General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. Specifically, in the Focal stage, a novel Adaptive Sign GCN (ASGCN) adaptively models each pose based on contextual correlations, skeletal topology, and semantic conditions, ensuring precise generation of local details. In the General stage, a Transformer-based module refines the entire pose sequence to enhance global coherence and naturalness. Moreover, we introduce a Semantic Consistent Guidance (SCG) mechanism that seamlessly integrates semantic supervision into diffusion training, enforcing tighter alignment between generated pose sequences and their intended gloss semantics. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate that FGDM achieves SOTA performance. The source code will be released on GitHub.
Paperid: 3677,   Poster  
Authors: Teng Yan, Yihan Liu, Jiongxu Chen, Teng Wang, Jiaqi LI, Bingzhuo Zhong
Title: AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
Abstract: Longterm language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR²-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR²-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and –24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.
Paperid: 3678,   Poster  
Authors: Davide Allegro, Shiyao Li, Stefano Ghidoni, Vincent Lepetit
Title: Catch Me if You Can: Active Mapping of Moving 3D Objects
Abstract: Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learningfree solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding.
Paperid: 3679,   Poster  
Authors: Wenyuan Gao, Yutan Wu, Xuming He
Title: CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
Abstract: Heterogeneous reconstruction in cryoelectron microscopy (Cryo-EM) is fundamental for understanding macromolecular structural diversity, yet remains challenging due to extreme noise, continuous conformational changes, and ambiguous image-to-structure mappings. Existing neural approaches often rely on encoder--decoder pipelines or fixed codebooks, which can be computationally demanding or struggle with complex heterogeneity. We propose CryoKRAQEN, a decoder-only framework that integrates triplane implicit representations with kernel-guided latent assignment and quantized embeddings to improve stability and structural discrimination. The method avoids encoder dependencies and mitigates collapse during training, enabling accurate modeling of both conformational and compositional variations. Across diverse Cryo-EM benchmarks, CryoKRAQEN delivers competitive performance, robust reconstructions, and interpretable latent organization compared to state-of-the-art neural and classical methods.
Paperid: 3680,   Poster  
Authors: yushi Huang, Xingtong Ge, RUIHAO GONG, Chengtao Lv, Jun Zhang
Title: LinVideo: A Post-Training Framework towards $\mathcal{O}(n)$ Attention in Efficient Video Generation
Abstract: Video diffusion models (DMs) have enabled highquality video synthesis, but their computation costs scale quadratically with sequence length due to the nature of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention demands costly pretraining. This is largely because linear attention lacks sufficient expressiveness and struggles with the complex spatiotemporal dynamics inherent to video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose a selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and even inefficiency of existing objectives in optimizing this challenge transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is highly efficient and recovers model performance. Extensive experiments show that LinVideo achieves a \mathbf1.43\text-1.71× speedup while preserving generation quality, and the 4-step distilled models further reduce latency by \mathbf15.9\text-20.9× with only a minor drop in visual quality.
Paperid: 3681,   Poster  
Authors: jianqiang xu, Gensheng Pei, 刘华峰 Liu, Yazhou Yao
Title: GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
Abstract: Reliable 3D perception from multiview roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary representations. Extensive experiments on the challenging RCooper dataset demonstrate that GSV2X sets a new state-of-the-art in multi-view roadside perception and exhibits remarkable robustness in complex, real-world scenarios.
Paperid: 3682,   Poster  
Authors: Haoyue Liu, Jinghan Xu, Luxin Feng, Hanyu Zhou, Haozhi Zhao, Yi Chang, Luxin Yan
Title: NEC-Diff: Noise-Robust Event–RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
Abstract: Highquality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event–RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001–0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness.
Paperid: 3683,   Poster  
Authors: Yunlong Tang, Chao Huang, Susan Liang, Jing Bi, Yicheng Wang, Daiki Shimada, Chenliang Xu
Title: Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
Abstract: Streaming dense video captioning requires realtime processing of continuous visual input while determining precisely when and what to caption. Current approaches primarily focus on designing complex external memory mechanisms, failing to leverage Large Multimodal Models' (LMMs) inherent long-context capabilities. Moreover, existing methods employing threshold-based caption triggering face a severe Threshold-Gated Discrepancy (TGD) problem, a training-inference mismatch arising from data imbalance, where models predominantly predict silence tokens, requiring thresholds that vary drastically across videos with extremely narrow effective ranges. We introduce Takusen, an asynchronous temporal modeling two-agent framework comprising a Small Multimodal Model (SMM) as an Oracle agent and an LMM as a Listener agent. The Oracle agent processes sparse video inputs at an accelerated rate to detect event boundaries, while the Listener agent processes dense inputs to generate accurate captions when prompted by the Oracle's signals. This architecture eliminates threshold dependencies by fundamentally changing how silence/generation decisions are made, resolving the TGD problem. To enhance robustness against boundary prediction instabilities, we integrate uniformly distributed fixed decoding points with Oracle-predicted boundaries. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that Takusen achieves state-of-the-art performance with a simpler and more efficient design that balances temporal sensitivity with descriptive accuracy.
Paperid: 3684,   Poster  
Authors: Houji Wen, Jiangyong Yu, Dawei Yang, Jun Li
Title: CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
Abstract: Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resourceconstrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder.This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization;and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights.Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence.Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.
Paperid: 3685,   Poster  
Authors: Runze Liu, Zeyue Wang, Fanghui Sun, Rui Liu, Yihan Yan, Shen Wang, Zhaoyang Zhang
Title: AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
Abstract: Unrestricted adversarial attacks based on generative models typically operate either directly in image space or through diffusionstyle denoising and re-noising, which limits transferability and robustness against defenses. We revisit this problem through the lens of flow matching and continuous-time velocity fields, and propose AdvFM, a velocity-field attack that injects adversarial signals into the flow-matching dynamics instead of the pixel space. Given a noisy state x_t, AdvFM perturbs the reconstruction at t=1 and converts this perturbation into a change of the velocity field, yielding a state update that amplifies the inner PGD step in the noisy space. We further introduce a lookahead variant that optimizes a two-point objective over the current and rolled-out reconstructions, reducing temporal mismatch along the ODE trajectory. From a theoretical perspective, we show that compared to diffusion-based attacks, AdvFM enjoys: (i) larger single-step increases in the black-box loss via step amplification, (ii) reduced gradient variance and stronger surrogate-target alignment due to Gaussian smoothing, enhancing its transferability, and (iii) perturbations that concentrate in robust-tangent directions, thereby aligning with robust gradients of adversarially trained models and surviving purification more effectively; the lookahead variant further lowers gradient noise for a two-point robust objective. Extensive experiments demonstrate that AdvFM achieves promising performance in both black-box transferability and a suite of adversarial training and purification defenses.
Paperid: 3686,   Poster  
Authors: Chengyang Liu, Zixuan Lin, Miaolin Han, Michael Ng, huibin Li
Title: Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction
Abstract: This paper introduces a novel framework for wideangle correction by distilling the principles of quasi-conformal mapping into an efficient and generalizable deep neural network. Our methodology can be divided into two primary stages. In the first stage, the wide-angle distortion correction problem is treated as a quasi-conformal mapping from the distorted image to the target image. In particular, we minimize the Beltrami smoothness energy with constraints of both line structures and human body regions. The Beltrami coefficient is subsequently estimated using the Proximal Gradient Descent algorithm. This alternating optimization yields the final quasi-conformal mapping and the corresponding corrected image. In the second stage, the Quasi-conformal-mapping Distilled Wide-angle Correction Network (QDWC-Net) is proposed, which is trained on these corrected images to predict the correction flow directly from a distorted input and built upon an encoder-decoder followed by a soft-argmin regression output head and loss functions. Extensive quantitative and qualitative experiments demonstrate the superior effectiveness and efficiency of our distilled approach, which achieves state-of-the-art correction results, especially in mitigating distortion in both portrait and human body regions.
Paperid: 3687,   Poster  
Authors: Ziliang Chen, Tianang Xiao, jusheng zhang, Yongsen Zheng, Yang Liu, Zhao-Rong Lai, Liang Lin
Title: A Causal Marriage between VLM and IRM from Understanding to Reasoning
Abstract: VisionLanguage Models (VLMs) like CLIP exhibit extraordinary out-of-distribution (OOD) generalization, while the theoretical foundations underlying this robustness remain largely unexplored. This work establishes a connection between CLIP and Invariant Risk Minimization (IRM), the principled paradigm to overcome OOD problems, through token-level causal representation learning. Our key insight is that CLIP's contrastive objective, when optimally trained, recovers modality-invariant causal factors at the word-and-phrase granularity. By decomposing text prompts into class-specific tokens (causal factors) and class-agnostic context tokens (environmental factors), we prove that a vocabulary-constrained InfoNCE objective becomes formally equivalent to IRM's invariance criterion. Grounded in this equivalence, we propose a mid-training paradigm aiming to inject invariant learning signals into pre-trained CLIP without architectural modification, yielding CLIP-IRM with superior OOD performance. We further extend this causal alignment to multimodal reasoning via using CLIP-IRM's invariant alignment scores as process-level rewards in reinforcement learning, effectively transplanting IRM's guarantees to robust sequential decision-making in Multimodal Large Language Models. Extensive experiments validate our theoretical framework and present substantial improvements in both multimodal OOD understanding and reasoning tasks.
Paperid: 3688,   Poster  
Authors: Alessio Mazzucchelli, María Naranjo Almeida, Jorge Bustos Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez
Title: BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
Abstract: Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes objectlevel editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian parameters. Our second loss also propagates gradients to the Gaussian parameters, but does so without passing through the rasterization. This allows it to modify the geometry of the scene, even if not much transmittance arrives to a Gaussian (partial or non-visible). Exhaustive comparisons to 12 state of the art methods over 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.
Paperid: 3689,   Poster  
Authors: Zhenbin Wang, Lei Zhang, Lituan Wang, Zhenwei Zhang, Guangwu Qian, Yan Wang, Wei Huang
Title: Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
Abstract: This work tackles a key challenge in testtime energy adaptation: prohibitive time overhead arising from recent state-of-the-art test-time adaptation (TTA) methods, which are built on energy models relying on iterative Monte Carlo or Langevin dynamics sampling with multiple stochastic updates per test instance to approximate energy gradients. We tackle the problem from an innovative control system perspective by i) describing the energy as a complex-valued wave, where the amplitude encodes energy uncertainty and the phase characterizes its evolution, and ii) maintaining a time-dependent wave equation that interprets TTA as a control system evolution process. By enforcing the control system law of probability current conservation, our method directs probability current away from high-energy (error-prone) regions toward low-energy (accurate) ones, achieving adaptive energy redistribution without additional stochastic sampling while preserving the overall normalization of the energy landscape. Experimentally, the proposed method significantly outperforms baseline methods across several public benchmark datasets, with adaptive time being only 1/3 ~ 1/7 of that required by the compared Top-1 to Top-3 baselines.
Paperid: 3690,   Poster  
Authors: jiang shao, Xinbo Zhao, Wenyin Tuo, XiaoChun Zou
Title: Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
Abstract: Egocentric Action Anticipation aims to infer future actions from videos, which is crucial for embodied AI systems. However, its advancement is hindered by the inherent stochasticity of the future, which introduces significant prediction uncertainty. Prevailing methods typically adopt an endto-end approach to model holistic spatiotemporal contexts, yet they often lack explicit semantic reasoning capabilities, making it difficult to handle open-ended future uncertainties.To address these challenges, we propose a Prototypical Action Reasoning Framework Facilitated by Vision-Language Alignment (PAR-VLA), which leverages the semantic alignment capability of vision-language models to learn disentangled visual prototype for verbs and nouns. These prototypes serve as robust semantic anchors, transforming the unconstrained temporal prediction problem into a conditional forecasting task guided by well-defined semantic concepts. Our multi-stage framework first extracts visually-grounded and text-aligned prototype groups from a VLM, learning multiple prototypes per category to capture intra-class diversity. Subsequently, a novel Prototypical Action Reasoning-guided Verb-Noun Encoding branch dynamically retrieves the most relevant verb and noun concepts based on visual observations and explicitly models their interactions to guide temporal anticipation. Furthermore, we introduce Dual-Stream Symbiotic Predictive Decoders to more finely capture the interdependencies between verbs and nouns during the prediction process. Experiments Results demonstrate that PAR achieves state-of-the-art performance and exhibits a strong capability in dealing with future uncertainty.
Paperid: 3691,   Poster  
Authors: Tze Ho Elden Tse, Jizong Peng, Angela Yao
Title: Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
Abstract: Learningbased structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the robustness and accuracy of pose estimation, demonstrating that global geometric reasoning can effectively complement learning-based inference in structure-from-motion.
Paperid: 3692,   Poster  
Authors: GwangWook Park, Hyo-Jun Lee, Jong-Hyeon Baek, Hanul Kim, Yeong Jun Koh
Title: EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
Abstract: Despite recent progress in 3D visual grounding, existing methods still struggle with three core challenges: 1) crossmodal misalignment that prevents textual cues from being reliably delivered to visual representations, 2) intra-class confusion arising from insufficient understanding of fine-grained expression cues, and 3) geometric reasoning errors caused by inaccurate aggregation of spatially relevant visual features. We propose EG-3DVG, a unified framework that addresses these issues through an expression and geometry aware grounding decoder. The decoder integrates two complementary attention modules—position-guided expression cross-attention (PECA) for reliable text–vision alignment and geometry-aware masked attention (GMA) for selective aggregation of geometry-consistent visual cues. To further distinguish semantically similar instances, we introduce expression-aware contrastive learning (ECL), which strengthens the alignment between the target object token and expression-relevant words. Extensive experiments on ScanRefer and SR3D/NR3D demonstrate that EG-3DVG achieves state-of-the-art performance in both 3D bounding box localization and mask prediction, validating the effectiveness of our geometry- and expression-aware design.
Paperid: 3693,   Poster  
Authors: Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko, Jihyung Kil, Hyunwoo J. Kim
Title: DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Abstract: Recent advances in vision–language models have shown strong performance across diverse multimodal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the waste of substantial computational resources, especially for long documents. We observe that existing token reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DOCPRUNE, a trainingfree document token pruning framework designed for efficient long document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model’s level of comprehension. Our experiments on the M3DocRAG benchmark show that DOCPRUNE improves throughput by 3.0× and 3.3× in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
Paperid: 3694,   Poster  
Authors: Ziyue Lin, Jiahe Hou, Xia Hongyu, Xinrui Xie, Feifei Wang, Yuyin Zhou, Wei Wang, Jiawei Liu, Liangqiong Qu
Title: Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
Abstract: We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and dataefficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Code is released to promote further research.
Paperid: 3695,   Poster  
Authors: Shuohao Shi, Qiang Fang, Xin Xu
Title: VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection
Abstract: Closedset object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. Notably, during inference, VLM4RSDet only retains the standard object detection architecture, thus avoiding any additional deployment overhead. Furthermore, we introduce a Global–Local Cross-Attention (GLCA) module and a Learnable Hierarchical Prediction Strategy (LHPS) to further improve collaborative training performance. Extensive experiments on five benchmark datasets demonstrate the effectiveness and robustness of our approach. In particular, our method outperforms the state-of-the-art by 7.5% in mAP_0.5:0.95 on the VisDrone2019 dataset. Our code will be made publicly available.
Paperid: 3696,   Poster  
Authors: Linfei Li, Lin Zhang, Ying Shen
Title: RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
Abstract: Visuallanguage grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in an end-to-end manner from natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code will be publicly released.
Paperid: 3697,   Poster  
Authors: Zijian Gao, Zicheng Sun, Xingxing Zhang, Kele Xu, Huaimin Wang
Title: Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning
Abstract: Continual Visual Question Answering (Continual VQA) poses unique challenges for multimodal continual learning, requiring models to incrementally acquire new knowledge while preserving visual–semantic grounding across tasks. However, existing benchmarks hinder fair and robust evaluation of such capabilities, as they allow models to exploit dataset biases rather than demonstrate genuine continual reasoning. We identify two structural flaws in current benchmark design. First, shared answer vocabularies across tasks encourage answer memorization, inflating performance and underestimating forgetting. Second, static answer priors within each task make the training and test answer distributions nearly identical, obscuring robustness under distribution shifts. To address these issues, we introduce UCoVQA, an Unbiased benchmark suite that enforces token-level disjoint answer spaces across tasks and introduces intra-task train–test distribution shifts, enabling fairer assessment of forgetting and generalization in multimodal continual learning. We further provide a parameter-efficient baseline that mitigates forgetting and enhances grounding through question-only replay and dual-level distillation, offering a lightweight and memory-efficient framework for continual adaptation. Extensive experiments on UCo-VQA reveal that prior methods substantially overestimate performance under biased setups, while our approach achieves state-of-the-art results, improving robustness and retention by up to 4.18% and 2.21%, respectively.
Paperid: 3698,   Poster  
Authors: Qinfu Xu, Liyuan Pan, Yiwei Wei, Shaozu Yuan, Jiaqi Chen, Tianyu Liu
Title: EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
Abstract: Multimodal Emotion Analysis (MEA) is crucial for humancentric AI, yet current methods struggle with two core challenges: the sparse nature of emotional cues across modalities and their inherent temporal asynchrony. Existing approaches, which often rely on implicit fusion, consequently suffer from diluted salient features and entangled representations. To address this issue, we propose EmoThinker, a new framework that advances MEA through explicit, structured reasoning. Our method introduces a structural token selection mechanism to concentrate on pivotal facial regions while refining background context, enhancing visual saliency and efficiency. For audio, an audio evidence extractor aggregates critical paralinguistic features into compact, emotion-rich tokens. More importantly, we enable step-by-step reasoning by constructing a Chain-of-Emotion-Thought dataset, which provides fine-grained annotations for disentangling asynchronous cues and resolving inter-modal conflicts. By decoupling evidence acquisition from reasoning, EmoThinker achieves a more interpretable and robust emotion analysis. Extensive experiments on multiple benchmarks demonstrate that our framework achieves new state-of-the-art performance.
Paperid: 3699,   Poster  
Authors: Yifang Xu, Jiahao Cui, Zhihao Zhu, Hanlin Shang, Shan Luan, Mingwang Xu, Feipeng Cai, Neng Zhang, Yaoyi Li, Jia Cai, Siyu Zhu
Title: DFM-Drive: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
Abstract: We introduce DFMDrive, a vision–language–action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, DFM-Drive performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute–accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, DFM-Drive achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 88.7 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving.
Paperid: 3700,   Poster  
Authors: Tianyou Bai, Yan-Ming Zhang, Zixiang Zhang, Jibin Zhou, Fei Yin, Cheng-Lin Liu
Title: End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer
Abstract: Engineering diagrams are the core carriers of technical information in industrial contexts, where the pressing demand for their digitization from industrial sectors has driven great advancements in related research domains. However, existing research still suffers from three limitations. Firstly, the detection of symbols, lines, and texts typically involves multiple independent models, resulting in cumbersome workflows. In addition, highresolution diagrams often impose an excessive computational cost on existing models. Moreover, parsing frameworks solely based on object detection can merely localize component positions, yet fail to capture the topological connection semantics and structured knowledge among components, thus offering limited convenience for industrial applications. To address these issues, we propose an end-to-end information extraction framework based on the Dynamically Tokenized Relation Transformer (DTRT), which can dynamically reduce received image tokens, filter redundant information, and efficiently extract structural knowledge to construct hyper-relational knowledge graphs. We practiced our model on piping and instrumentation diagrams (P&IDs) and electrical diagrams (EDs): the former are widely used in chemical engineering enterprises, while the latter are employed to describe circuit systems. DTRT achieves an R@1000 accuracy of 94.84% on PIDs and R@200 accuracy of 92.52% on EDs with a significantly reduced computational cost.
Paperid: 3701,   Poster  
Authors: Ze Liu, Kai Zhang, Xianquan Wang, Shuochen Liu, Jiaxian Yan, Yupeng Han, Qi Liu
Title: From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward
Abstract: Handwritten mathematical expression recognition is hindered by a fundamental misalignment between the dual representations of LaTeX formulas: the symbolic text and the rendered visual image. This discrepancy means that textually distinct LaTeX sequences can produce visually identical outputs, while minor textual errors can cause catastrophic rendering failures. As a result, textlevel reward mechanisms cannot perfectly assess the quality of model predictions, failing to effectively guide the model towards optimal performance during training. To overcome this limitation, we introduce the Image Matching Score (IMS), a lightweight yet effective reward based on the structural edit distance of column-wise image projections, which robustly quantifies the visual fidelity between rendered formulas. Leveraging IMS, we then propose Image-Matching driven Policy Optimization (IMPO), a training framework built upon Group Relative Policy Optimization (GRPO). This approach facilitates stable policy learning directly from our sequence-level visual reward, notably without the need for a separate value function network. Extensive experiments demonstrate that IMPO yields consistent performance gains across various backbone models on the challenging CROHME, HME100K, and M^2E datasets. Our model-agnostic framework establishes new state-of-the-art results, improving the Expression Recognition Rate by an average of 1.1% and up to 1.37% over strong prior methods. The code can be found in the supplementary materials.
Paperid: 3702,   Poster  
Authors: Linkang Xu, Gang Li, Yue Song, Xiangxin Ji
Title: Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network
Abstract: Dronebased building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. Its decoder incorporates a Hilbert curve–based topology-preserving mechanism to maintain spatial continuity and boundary precision during category layer computation. A lightweight multi-scale fusion module enhances semantic representation, while global context modeling strengthens holistic perception. Experiments on the building defect dataset show that TPSegformer outperforms existing segmentation methods, achieving 80.77% mIoU and 90.22% Acc. On the Dacl10k dataset, it maintains strong generalization, reaching 44.27% mIoU and 60.32% Acc across diverse materials and defect types.
Paperid: 3703,   Poster  
Authors: Zhou Tao, Shida Wang, YongXiang Hua, Haoyu Cao, Linli Xu
Title: DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models
Abstract: Multimodal Large Language Models have achieved impressive performance on a variety of visionlanguage tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
Paperid: 3704,   Poster  
Authors: Qiang Qi, Xiao Wang, Zongyuan Du, Yu Zhang
Title: When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
Abstract: Video object detection has gained notable progress with the advent of transformers. While transformers excel at modeling longrange contextual dependencies, the quadratic complexity limits their efficiency in long-sequence processing. In contrast, Mamba offers greater efficiency in modeling long sequences but tends to exhibit relatively limited contextual learning capability compared with transformers, and its application to video object detection remains unexplored. To harness the complementary strengths of transformers and Mamba, we propose a hybrid Transformer-Mamba network for video object detection (TMambaDet), a pioneering framework in this domain that combines the long-range modeling power of transformers with the efficient long-sequence processing capability of Mamba. Our TMambaDet is characterized by three core components: 1) a spatial adaptive deformable transformer encoder to effectively model the long-range dependencies within each frame, enabling intra-frame feature aggregation that substantially improves the spatial feature representations of objects; 2) a temporal cascaded bidirectional Mamba encoder to efficiently capture the long-range dependencies across frames in video sequences with linear complexity, enabling inter-frame feature aggregation that effectively enhances the temporal feature representations of objects; 3) a Mamba entangled transformer decoder to fully explore the interactions between object queries and spatial-temporal features, enabling fine-grained query-feature alignment that effectively enriches the instance-level representations of object queries. We conduct experiments on the ImageNet VID and EPIC-KITCHENS-55 datasets, showing that TMambaDet achieves state-of-the-art results. Codes will be released.
Paperid: 3705,   Poster  
Authors: Guangpu Yang, Steffen Kieß, Hanxiang Luo, Xingyu Liu, Sven Simon
Title: Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
Abstract: We propose ExactGS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based formulation, enabling fast CUDA-based hardware rasterization.Additionally, we present a precise Gaussian-tile intersection algorithm, enabling faster and efficient rendering.We demonstrate the performance gains by reconstruction and novel view synthesis through different synthetic and real-world datasets. Our method also contributes to the visible light scene representation, where the density accumulation (X-ray attenuation coefficient) in our model can be replaced by the integral of the opacity at alpha blending.
Paperid: 3706,   Poster  
Authors: Yingdong Gu, Shaocheng Yan, Zhenjun Zhao, Yuan Kou, Jianxin Luo, Pengcheng Shi, Jiayuan Li
Title: ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
Abstract: Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with featurebased localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted \alpha-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc.
Paperid: 3707,   Poster  
Authors: Haoyu Jiang, Xiaoliang Chen, Duoqian Miao, Xiaolin Qin, Xianyong Li, Yajun Du
Title: CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis
Abstract: Multimodal sentiment analysis requires integrating language, visual, and acoustic cues, yet these modalities are often noisy, incomplete, or contradictory, making fusion unreliable. Most existing methods assume uniformly trustworthy modalities and thus degrade when signals conflict.To address this, we propose CICA, a framework that couples ConfidenceAware Pretraining with Confidence-Informed Attention. In pretraining, each modality encoder learns to estimate the reliability of its own representation, producing both embeddings and confidence scores. These scores then guide a confidence-informed attention mechanism, which strengthens contributions from reliable modalities while suppressing noisy or conflicting ones, enabling adaptive fusion under varying signal conditions.CICA achieves state-of-the-art performance across four major benchmarks on MOSI, MOSEI, CH-SIMS, and CH-SIMSv2. It achieves MAE 0.630 and Corr 0.855 on MOSI, and MAE 0.489 and Corr 0.856 on MOSEI, significantly surpassing prior methods. Consistent improvements are also observed across Acc-7, Acc-2, and F1 metrics. Under noisy and missing-modality conditions, CICA maintains significantly more stable performance, indicating improved robustness and interpretability.
Paperid: 3708,   Poster  
Authors: Haojuan Li, Tang Ruohan, Dongzhou Cheng, Zongpu Zhang, Jian Li, Jiaqi Wang
Title: SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
Abstract: Video Large Language Models (VideoLLMs) face a fundamental performance bottleneck: the token explosion intrinsic to video inputs. The resulting O(N^2) prefill cost makes conventional Transformer inference prohibitively expensive at scale. Existing attempts fall into a hard accuracy–latency dilemma: naive sparsification risks losing essential temporal–spatial context, whereas naive parallelization introduces substantial communication and memory overhead.To overcome this impasse, we argue that algorithm–system codesign is not optional but necessary, jointly optimizing what to compute (sparsification) and how to compute it (parallelism). We introduce SegMo, a unified framework that instantiates this co-design principle and enables efficient, accurate VideoLLM inference at scale.SegMo is driven by the empirical insight that VideoLLM attention exhibits Local Cohesion. Our system implements this via two integrated components:(1) Content-Aware Sparsification (CAS): A lightweight, hierarchical algorithm that first employs Query Relevance for scene-level assessment, and then uses Temporal Redundancy for intra-scene static redundancy pruning, to generate a precise, non-uniform computation load, ensuring accuracy.(2) Locally-Cohesive Segment Parallelism (LSP): A novel paradigm that leverages attention locality to partition the video at scene boundaries, using a lightweight Global Context Injection mechanism to replace the massive communication and memory overheads of global attention.SegMo was validated across LVBench, LongVideoBench, and Video-MME. Our CAS module improved accuracy by up to 12.00%. When integrated with LSP, the full system (CAS + LSP) achieved a peak prefill acceleration of 3.55x, while still maintaining a significant accuracy gain of up to 8.31%.
Paperid: 3709,   Poster  
Authors: Sashuai zhou, Qiang Zhou, Ma Junpeng, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, YuCheng YuCheng, Bo Zheng, Zhou Zhao
Title: SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Abstract: Recent advances in textto-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present SpatialReward, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a Prompt Decomposer extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce SpatRelBench, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
Paperid: 3710,   Poster  
Authors: Hyeonggon Ryu, Joon Chung, David Harwath
Title: Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
Abstract: Many audiovisual learning methods have focused on aligning audio and visual information, either through semantic or temporal correspondence. However, most of these works have utilized monaural audio, which does not contain information about the spatial location of the sound source. In contrast, humans and other animals utilize binaural hearing to perceive this spatial information. Combining spatial sound and visual perception enables powerful high-level reasoning: for example, a person looking for their phone may hear the ringing sound coming from a backpack sitting on a table, and quickly infer that the missing phone is inside the backpack. In this paper, we investigate the problem of Audio-Visual Spatial Reasoning. We design a spatial audio-visual question answering dataset to cover scenarios where semantic correspondence between audio and visual signals is absent but spatial alignment exists, as well as cases with multiple audio-visual semantic correspondences that require spatial reasoning to disambiguate. We propose a model that learns spatial comprehension across the audio and vision modalities by connecting them with a large language model and experimentally demonstrate that spatial sound perception is an essential part of our task.
Paperid: 3711,   Poster  
Authors: Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu
Title: OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from natural language queries, faces significant challenges in openworld applications. These challenges stem from the limited scale and semantic diversity of existing datasets, which lead to a performance gap between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks.
Paperid: 3712,   Poster  
Authors: HUANG HUIYUAN, SANG MIN YOON
Title: SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
Abstract: Person reidentification (Re-ID) requires a balance between discriminative capability and computational efficiency for real-world deployment. However, even the Visual State Space Model (SSM), despite its linear complexity, suffers from redundant computation due to dense token processing. We propose SSM-aware Token-Efficient VMamba (TE-VMamba), which integrates adaptive patch pruning and merging modules to reduce redundant tokens while preserving identity-discriminative cues. The layer-adaptive pruning strategy removes low-importance tokens in shallow layers to enhance efficiency, whereas the depth-aware merging strategy consolidates semantically similar tokens in deeper layers to improve representation compactness. Learnable layer-wise thresholds dynamically balance accuracy and computational cost across the network. On the Market-1501 benchmark, TE-VMamba reduces FLOPs by over 60%, achieving significant computational savings while maintaining competitive accuracy. These results highlight the potential of structured token reduction in state-space models for efficient and powerful person re-identification.
Paperid: 3713,   Poster  
Authors: Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
Title: Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Abstract: Large VisionLanguage Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.
Paperid: 3714,   Poster  
Authors: Yifan Li, Haofeng Huang, Wenhan Yang, Jiaying Liu
Title: Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment
Abstract: Lowlight degradation hampers machine understanding at night. Existing methods either overfit labeled data (paired supervision) or specific distributions (unpaired supervision), resulting in poor generalization under unseen degradations. In this paper, we propose UniPrior, a unified prior-based low-light adaptation framework that integrates the general semantic prior embedded in vision foundation models (VFMs) with illumination-invariant priors, to capture both stable and changing semantics under varied low-light degradation without any real low-light training data. In detail, the illumination-invariant prior is used as an auxiliary input, and a parallel decoder reconstructs it as a regularization target, enforcing representation consistency and reducing feature drift. Such signal constancy enables us to build a VFM-aligned semantic space via a contrastive training strategy guided by VFM self-correlation maps, enriching features with high-level cues, thereby improving adaptation to diverse low-light conditions. Beyond high-level features, we also give a joint consideration of such unified prior and low-level signal space through our machine-oriented enhancement scheme. We extend the signal prior to handle overexposure and inject VFM-guided semantic cues into the enhancement process via a CLIP-based loss. This coupling of semantic alignment and pixel correction enables sample-adaptive optimization to improve performance. Extensive experiments on multiple low-light tasks demonstrate our method’s superiority and practical utility.
Paperid: 3715,   Poster  
Authors: Zecheng Hao, Yifan Huang, Zijie Xu, Wenxuan Liu, Yuanhong Tang, Zhaofei Yu, Tiejun Huang
Title: Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
Abstract: Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence due to their braininspired and energy-efficient properties. Compared to vanilla Spatial-Temporal Back-propagation (STBP) training methods, online training can effectively avoid the risk of GPU memory explosion. However, current online learning frameworks cannot tackle the gradient discrepancy problem between the forward and backward process, merely aiming to optimize the GPU memory, resulting in no performance advantages compared to the STBP-based models in the inference stage. To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. We theoretically point out that our learning framework can effectively separate temporal gradients and address the misalignment problem of surrogate gradients, as well as achieving full-stage optimization towards learning precision, memory footprint and power consumption. Experimental results have demonstrated that our scheme is enable to achieve state-of-the-art performance for multiple evaluation metrics, breaking through the traditional paradigm of SNN online training and deployment.
Paperid: 3716,   Poster  
Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Shuo Wang, Kai Lv
Title: Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
Abstract: Pretraining on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such global modeling is tightly coupled with the morphology, hindering transfer across domains. In contrast, despite the vast disparity in global motions, the local components exhibit similar motion patterns across different agents. Building on this insight, we propose a novel Deconstruct–Recompose Paradigm (DRP) for learning transferable local motion representations. Specifically, in the Deconstruct phase, we identify multiple local points and track their frame-wise motions, defining each as an Atomic Action. We introduce a Dual-Attention Encoder (DAE) to learn local motion representations from these Atomic Actions, capturing their spatiotemporal relationships. In the Recompose phase, we compose local motion representations with a learnable Motion Aggregation Token '[MAT]' via latent dynamics model learning. Additionally, an adapter bridges local motion and downstream action-specific dynamics to accelerate policy learning. Extensive experiments demonstrate that our method effectively transfers to diverse robotic control and manipulation tasks, significantly improving sample efficiency and performance.
Paperid: 3717,   Poster  
Authors: Yuchuan Li, Azadeh Motamedi, Hyock Ju Kwon, Chul Park, Il-Min Kim
Title: UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
Abstract: Outof-distribution (OOD) detection is a key requirement for reliable deployment in open-world environments, where a model must recognize inputs that fall outside the semantic scope of known concepts. While recent advances in vision–language models (VLMs) have achieved strong results in image-level OOD detection, most methods still assume that each image contains a single dominant object. This assumption severely limits their applicability to real-world settings where scenes are naturally composed of multiple objects that each demands independent OOD assessment. Existing object-level approaches, including the current SOTA method RUNA, remain constrained by coarse global representations and insufficient modeling of contextual dependencies between objects and their backgrounds. We propose UNI-OOD, a unified framework that performs both object- and image-level OOD detection within a single vision–language model, without requiring prior knowledge of which task is being addressed at inference time. The key idea is to leverage cross-context attentive modeling that captures complementary visual and textual semantics. UNI-OOD learns to attend to fine-grained spatial details within each object, aligns visual and linguistic embeddings to strengthen semantic correspondence, and model interactions between target objects and their surrounding context. By jointly reasoning over object-centric and background cues, the framework disentangles informative visual evidence from spurious correlations and enables a consistent OOD scoring mechanism across different visual granularities. Extensive experiments on standard object- and image-level benchmarks demonstrate that UNI-OOD achieves substantial and consistent improvements over previous approaches, establishing new SOTA performance in both object-level and image-level OOD detection. Beyond empirical gains, this study provides the first holistic formulation of OOD detection that bridges the gap between object- and image-level detection within a single unified vision–language paradigm, establishing a general foundation for open-world applications.
Paperid: 3718,   Poster  
Authors: Yi Liu, Yi Wan, Lei Yu, Panwang Xia, Qiong Wu, Yingying Pei, Xuejun Huang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Yongjun Zhang
Title: Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency
Abstract: Owing to the weak stereo geometry of satellite images, Planar Block Adjustment (PBA) is a predominant technique for correcting geometric distortions in satellite images, which treats elevation as a known constraint and primarily optimizes planar coordinates. Existing PBA methods mainly rely on explicit tie points, suffering from parallax caused by inaccurate elevation (e.g., near high buildings) and irreversible error accumulation, which severely degrades adjustment accuracy. In this paper, a "Beyond Tie Points" paradigm for satellite image adjustment is proposed. A pretrained feature extractor is employed to extract robust dense features and a parallaxaware confidence map from each image. A gridded coarse-to-fine optimization framework then directly solves for the adjustment parameters basing on confidence-weighted feature consistency. Experiments conducted on multiview satellite image datasets covering Beijing, Guangzhou and San Jose demonstrate that the proposed method is significantly superior to traditional approaches in both accuracy and robustness, reducing the average error by up to 75.43% compared to traditional PBA.
Paperid: 3719,   Poster  
Authors: Wenwen He, Wenke Huang, Yiyang Fang, Wenjie Qu, Jiaheng Zhang, Mang Ye
Title: Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack
Abstract: Federated Learning (FL), a distributed learning paradigm that enables local training on userheld data across decentralized devices, is vulnerable to backdoor attacks due to limited visibility into client updates. Exploiting this opacity, adversaries induce targeted misbehavior on trigger inputs without affecting overall performance, thereby compromising the trust and integrity of collaborative training in federated learning systems. Existing federated backdoor attacks mainly concentrate on benign knowledge alignment on trigger-surface design or representation guidance to evade defense mechanisms. However, trigger-surface attacks suffer from insufficient alignment, leaving malicious knowledge distinguishable from benign updates. In contrast, representation-guided attacks attempt to obscure the boundary between benign and malicious behaviors. Nevertheless, excessive incorporation of benign knowledge within a shared parameter space leads to over-alignment, ultimately degrading attack effectiveness. To overcome shared parameter space dilemma in backdoor attack, we propose Batman, a novel backdoor attack that aligns benign knowledge within the malicious null space, which effectively decouples malicious space from shared parameter space and enables benign alignment in an orthogonal direction of this space that does not interfere with the attack effectiveness. To further enhance stealthiness, we combine both clean and global models to guide the alignment perturbation within this null space to evade detection. Experiments on four benchmark datasets demonstrate that Batman consistently achieves strong backdoor performance while remaining stealthy under various defenses.
Paperid: 3720,   Poster  
Authors: Yanming hui, Fanhua Shang, Hongying Liu, Ben Wang, Zhenwei Zhang, Liang Wan, Wei Feng, Tong Xue, Bingqin Lv
Title: VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light environment
Abstract: We propose an integrated learning scheme of Video SuperResolution and Enhancement in Low-Light environment, named VSRELL, which aims to recover Well-Illuminated High-Resolution (WIHR) sequence from Low-Light Low-Resolution (LLLR) counterparts. Due to the complex coupling of multiple degradations, this joint task has received relatively little attention. Our approach jointly models illumination enhancement and spatial-temporal super-resolution to disentangle intertwined degradations. Specifically, we introduce an Illumination-Noise Co-Optimization (INCO) network that employs a dynamic window partitioning strategy to explicitly model physical priors of illumination variations and noise distributions within individual frames of a long-term sequence. This effectively suppresses cross-frame noise accumulation and illumination flickering, achieving simultaneous optimization of motion compensation and brightness correction.Additionally, an Illumination-Sensitive Feature Propagation (ISFP) mechanism is introduced, which utilizes hierarchical illumination-sensing gating unit to adaptively modulate feature channel responses. By adjusting feature propagation intensity and using memory feature attenuation strategy, it can enhance the weighting of high-quality features and suppress error accumulation propagation and strengthen transmission efficiency. The experiments show that VSRELL can explicitly strengthen the brightness continuity and texture fidelity of the restored output, maintaining temporal consistency across the video.
Paperid: 3721,   Poster  
Authors: Xinyuan Zhao, Hanlin Gu, Guibao Song, Gongxi Zhu, Yifei Zou, Lixin Fan, Yuxing Han
Title: PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
Abstract: As publicly available data dwindles, synthetic data generation (SDG) has become a practical solution for privacypreserving data sharing. By training generative models on private data, SDG creates samples that retain task-relevant features while obfuscating sensitive content. However, recent work shows that synthetic data can still leak private information via membership inference and reconstruction attacks. Existing defenses often degrade downstream utility. To address the privacy-utility trade-off, we formulate SDG as a bi-objective optimization problem. Yet, intractable gradients and expensive subset evaluation pose major challenges. We address this via alternate optimization over the generative model and data selection parameter, and further recast the selection step as a discrete-time optimal control problem, solved using Pontryagin’s Maximum Principle. We propose PrivSynth, a framework that quantifies multiple privacy risks and integrates it into the control objective. Theoretical analysis guarantees convergence, and experiments on benchmark and medical datasets show that PrivSynth achieves better utility and stronger privacy protection than state-of-the-art methods.
Paperid: 3722,   Poster  
Authors: Jaehoon Jeong, Yi Hu, Soopil Kim, Jongseong Jang, Soonyoung Lee, Sang Hyun Park
Title: IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
Abstract: Nuclei instance segmentation and classification are fundamental but remain challenging in pathology due to severe class imbalance and organand stain-induced variability. While vision–language approaches can inject explicit semantic cues that reduce spurious contextual bias under imbalance, the absence of instance level textual annotations has limited their utility for nucleus-level analysis. We introduce an instance-level vision–language framework that derives attribute-guided textual descriptions from ground-truth masks. We then align visual representations with these semantic text anchors via contrastive learning, coupling morphology with semantics at the instance level. To capture intra-class variations while maintaining organ-consistent class semantics, we learn multiple class-specific tokens that act as prototypes representing diverse submodes within a class, summarizing morphologically similar nuclei. Our approach improves both segmentation and classification without manual text labels, indicating that language-guided instance alignment combined with prototype-based semantic feedback yields more discriminative and generalizable nuclei representations.
Paperid: 3723,   Poster  
Authors: Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo
Title: Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs
Abstract: Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.
Paperid: 3724,   Poster  
Authors: Xianyun Wang, Jiaxu Miao, Tian Xu, Siyuan Wang, Yuehao Li, Haoyang Hu, Jun Xiao, Yonghong Tian, Jun Yu
Title: PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence
Abstract: Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instancelevel perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Considering the substantial discrepancy between panoptic and instanced depth map, we further introduce a novel Instanced Label Distribution Smoothing (ILDS) loss, followed by Gram Anchoring, to mitigate the inherent conflict between dense and discrete representation. Trained on synthetic data only, our model achieves state-of-the-art results in both depth estimation and interactive segmentation on public benchmarks. Extensive experiments demonstrate superior visual efficiency in embodied tasks compared to current fundamental models. We believe that our efficient and flexible geometric 3D model offers a new foundation for vision tasks in embodied intelligence. The dataset and the code will be released.
Paperid: 3725,   Poster  
Authors: Yi He, Lei Yang, Bofan Chen, Shi-Lin Wang
Title: Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis
Abstract: In recent years, facebased authentication methods are gradually replacing traditional methods across various applications, offering enhanced security and user convenience. However, these methods are threatened by the continuously evolving DeepFake techniques. In this paper, a novel Visual Speaker Authentication (VSA) approach based on dynamic lip-prints is proposed to improve system security against diverse attacks. The lip-prints are discriminative viseme segments that capture user's localized speaking habits. By leveraging these dynamic lip-prints, this approach can expand the prompt set without requiring additional user-recordings or model retraining, thereby strengthening resilience to replay attacks. Moreover, a Multi-Layer Dynamic-Enhanced Encoder is introduced to model fine-grained lip dynamics, addressing data scarcity challenges and ensuring robust feature extraction even in scenarios with short temporal spans and limited enrollment data. We have carried out extensive experiments on several datasets and the results have demonstrated the effectiveness of the proposed method in both security enhancement and prompt set scalability.
Paperid: 3726,   Poster  
Authors: MENGZHEN LIU, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Title: SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robot
Abstract: Active perception and manipulation are crucial for embodied robots to interact with complex scenes. Existing methods struggle to unify semanticdriven perception actively with robust, viewpoint-invariant execution accordingly. To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Central to our approach is a decoupling of camera and manipulation actions, contrary to shared-action-space, and learning in a bottom-up strategy: we first train semantic camera control on our proposed large-scale dataset, then jointly optimize both action types via hybrid data. To support this learning, we introduce ActiveViewPose-200K, comprising 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We further present ActiveManip-Bench, the first benchmark filling the gap to evaluate active manipulation. Extensive experiments in both simulation and real-world settings show that SaPaVe outperforms recent VLA models such as GR00T and \pi_0, achieving up to 31.25% higher success rates in real-world tasks. Our results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation.
Paperid: 3727,   Poster  
Authors: Youngrok Jang, Hyesoo Kong, Kyunghwan An, Jae Huh, Gyeonghun KIM, Stanley Jungkyu Choi
Title: VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
Abstract: Realworld documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess such answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the VinQA training split substantially improves their performance and markedly narrows this gap. Modality Encoding is initially more robust than Page Encoding for complex documents with long text, many visual elements, and diverse visual citation requirements. After training on VinQA, however, Page Encoding reaches a comparable performance level, showing that it can compete effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.
Paperid: 3728,   Poster  
Authors: Lixuan Chen, Zhongnan Liu, Jesse Hamilton, James Balter, Jeong Joon Park, Liyue Shen
Title: Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
Abstract: Prospective reconstruction is crucial in many clinical applications such as MRIguided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.
Paperid: 3729,   Poster  
Authors: Yingzhao Li, Yanjie Liu, lijun zhao
Title: ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain
Abstract: Active 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose GLGraph, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce 4D-Reg, which identifies floaters through manifold discrepancies among three depth types (R-Depth, \alpha-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiments demonstrate state-of-the-art reconstruction completeness and rendering fidelity on standard benchmarks.
Paperid: 3730,   Poster  
Authors: Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang
Title: EVObject: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
Abstract: We introduce EVObject for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and realworld point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EVObject integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries before discovering object. We conduct extensive experiments on two real-world datasets and one synthetic dataset, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.
Paperid: 3731,   Poster  
Authors: Yuehao Liu, Shanyan Guan, Weijia Zhang, Xuanming Shang, Yanhao Ge, Wei Li, Chao Ma
Title: Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Abstract: Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecturebased approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose \our, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. The two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT~\citeguo2025hide show that \our~establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.
Paperid: 3732,   Poster  
Authors: Yuquan Yang, hui zhang, Wenyu Lu, Ziyin Zhang, Chuanming Zhang, Xiaohua Xu
Title: Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions
Abstract: Current collaborative perception systems have significantly improved 3D object detection performance.However, widely used LiDAR and camera systems often suffer performance degradation under adverse weather conditions.The weatherrobust 4D radar provides a promising solution to address this challenge.Nevertheless, the effective fusion of sparse 4D radar measurements with degraded LiDAR data remains a significant challenge due to cross-modal corruption and information loss. In this work, we propose a novel hybrid robust collaborative perception framework (HRCP), designed to improve the collaborative perception performance under adverse weather conditions through LiDAR-4D radar fusion.Specifically, we introduce a hybrid collaboration strategy that considers their distinct physical properties and differently processes them during information transmission.Additionally, we propose a bidirectional cross-modal gating (BCMG) module that enables LiDAR and 4D radar to mutually validate feature reliability, ensuring consistent cross-modal representation, and an adaptive feature enhancement (AFE) module that enables comprehensive refinement of degraded and suppressed regions to mitigate information loss.Extensive experimental results demonstrate that our method outperforms previous state-of-the-art approaches under adverse weather conditions.
Paperid: 3733,   Poster  
Authors: Yichao Xu, Qiaowei Miao, Jinsheng Quan, Wei Yang, Zhihui Li, Yawei Luo
Title: LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
Abstract: Constructing a 4D language field that supports openvocabulary queries is essential for semantic perception and interaction in dynamic environments. Existing 4D Gaussian-based approaches face two major challenges. First, the assumption of a static identity per Gaussian leads to semantic inconsistency, as motion fields warp Gaussians across object boundaries over time, causing oscillating identity assignments. Second, current methods typically model dynamic semantics as a set of discrete, predefined state prototypes, which fail to capture dynamic continuity and delineate fine-grained temporal boundaries. To address these issues, we propose LangField4D, a novel 4D Gaussian framework that jointly models spatio-temporal identity and semantics in a unified and continuous representation. We introduce an Identity-Adaptive Gaussian Grouping module that assigns each Gaussian a learnable adaptation feature to dynamically capture its object affiliation, ensuring consistent semantic tracking across time. Building upon this affiliation structure, we further design a Continuous Spatio-Temporal Semantic Learning mechanism based on a Tetraplane representation, which encodes both time-invariant and time-varying semantics within a continuous latent space. Extensive experiments on dynamic scene benchmarks demonstrate that we achieve state-of-the-art performance on both time-agnostic and time-sensitive open-vocabulary query tasks.
Paperid: 3734,   Poster  
Authors: Zhongze Wu, Xiu Su, Feng Yang, Dan Niu, Shan You, Yueyi Luo, Jun Long
Title: Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors
Abstract: Openvocabulary detectors (OvOD) inherit tightly coupled cross-modal knowledge from web-scale pretraining, creating privacy, copyright, and compliance risks. Existing machine unlearning methods face \emphgeometric entanglement interference in OvOD: forgetting updates inevitably distort preserved knowledge due to shared semantic factors in decomposable embeddings. We introduce SafeDetect, a geometrically constrained unlearning framework that constructs a null-space from preserved knowledge embeddings offline, then constrains parameter updates to this orthogonal complement, mathematically preventing interference with retained concepts. Forgetting is achieved through a one-step mean-flow objective that drives forgotten concepts toward non-detectable, while multimodal decoupling prevents cross-modal recovery. We establish UOD-Bench, the first unified benchmark for OvOD unlearning, featuring 14.7K images with 67.3K region-phrase pairs across three tasks. Extensive experiments across UOD-Bench and standard benchmarks with diverse architectures (e.g., GroundingDINO, LLM-Det) demonstrate that SafeDetect achieves superior forgetting efficacy (64.75% improvement over NPO) while maintaining stable retention performance and significantly better zero-shot generalization, with 1.5× faster convergence than iterative methods. Code and benchmark will be released.
Paperid: 3735,   Poster  
Authors: Shih-Po Lee, Ehsan Elhamifar
Title: AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
Abstract: Virtual task assistants must recognize and explain users’ mistakes to provide effective and corrective guidance. In this paper, we address the problem of error reasoning in long task videos, which is to detect and explain errors. Although recent Vision–Language Models (VLMs) demonstrate strong capabilities in visual question answering, they struggle to attend to the sparse spatiotemporal cues associated with errors in long task videos. We introduce an error reasoning framework, AXGReasoner, that leverages a frozen VLM in conjunction with a proposed Action eXecution Graph (AXG) and a temporal action segmentation (TAS) model, obtained and learned from normal (error-free) videos. To enable VLMs to attend to the sparse spatiotemporal cues associated with errors, we decompose each action segment of the video, obtained by TAS, into a sequence of fine-grained subactions by aligning it with the AXG. For each subaction segment, we query the VLM using a small number of keyframes and enhanced prompts to detect and explain errors, enabling efficient inference. To avoid costly manual subaction annotations, we develop a method to automatically construct AXG from training videos using foundation models. Extensive experiments on EgoPER and CaptainCook4D show that our method consistently improves over VLM baselines in error explanation by effectively identifying spationtemporal cues and achieves state-of-the-art performance in error detection.
Paperid: 3736,   Poster  
Authors: Boyang Dai, Chaoqi Chen, Yizhou Yu
Title: Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection
Abstract: Outof-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts.
Paperid: 3737,   Poster  
Authors: xuzhi wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao
Title: ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
Abstract: Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing finegrained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASFormer achieves state-of-the-art performance. Code will be made publicly available.
Paperid: 3738,   Poster  
Authors: Junyang Ji, Qifan Liu, Wenming Yang, Zhihai He
Title: CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Abstract: Recent Large VisionLanguage Models (LVLMs) have shown impressive capabilities in multimodal understanding and generation. Despite this progress, they remain prone tohallucination, where model outputs conflict with the visual input due to an over-reliance on textual priors. Existing inference-time mitigation approaches frequently depend on multi-pass or contrastive decoding, which increases latency and limits their applicability in real-time settings. To address this limitation, we proposeCausalLens, a training-free and single-pass intervention that directly adjusts the decoder hidden states to strengthen visual grounding. By decomposing attention heads into visual, textual, and system prompt pathways, CausalLens identifies visually reliable heads using a sensitivity measure and selectively adjusts their mid-layer hidden-state contributions. A projection-aligned correction further stabilizes these adjusted states after multi-head fusion, ensuring that the enhanced visual information is preserved throughout decoding. Extensive experiments across multiple hallucination benchmarks and LVLM architectures demonstrate that CausalLens consistently improves visual fidelity while adding negligible computational overhead. The method requires no fine-tuning or architectural changes, making it well-suited for practical, latency-sensitive applications.
Paperid: 3739,   Poster  
Authors: Yuzheng Gao, Yuxing Long, Lei Kang, Yuchong Guo, Ziyan Yu, Shangqing Mao, Jiyao Zhang, Ruihai Wu, Dongjiang Li, Hui Shen, Hao Dong
Title: RealAppiance: Let Highfidelity Appliance Assets Controllable and Workable as Aligned Real Manauls
Abstract: Existing appliance assets suffer from poor rendering, incomplete mechanisms, and misalignment with manuals, leading to simulationreality gaps that hinder appliance manipulation development. In this work, we introduce the RealAppliance dataset, comprising 100 high-fidelity appliances with complete physical, electronic mechanisms, and program logic aligned with their manuals. Based on these assets, we propose the RealAppliance-Bench benchmark, which evaluates multimodal large language models and embodied manipulation planning models across key tasks in appliance manipulation planning: manual page retrieval, appliance part grounding, open-loop manipulation planning, and closed-loop planning adjustment. Our analysis of model performances on RealAppliance-Bench provides insights for advancing appliance manipulation research
Paperid: 3740,   Poster  
Authors: YUNYI LIU, Yingshu Li, Tong Chen, Lingqiao Liu, Lei Wang, Luping Zhou
Title: SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
Abstract: Radiology report generators often produce fluent text yet miss crucial details, leading to local semantic conflicts or flipped findings that require stronger penalties. Crossentropy (CE) merely increases the probability of the ground-truth token y^ without directly suppressing the model’s current wrong choice \haty, and treats all positions uniformly, so corrections are not prioritized. We introduce a self-adaptive optimization framework that dynamically adjusts token-level gradients based on semantic discrepancy cues derived from a frozen LLM referee. The LLM itself is not the contribution—it merely provides weak supervision to trigger the adaptive learning process. Within this framework, (i) semantic conflicts between the predicted and reference reports are automatically localized and tagged with ``...`` (used only during training), and (ii) adaptive, stronger penalties are applied within these sparse but critical spans. Updates follow a push–pull scheme: error spans are pushed down, while non-error tokens are reinforced. The update strength is governed by two complementary signals—normalized entropy (for uncertainty calibration) and focal-style confidence (for handling over- and under-confident predictions). On MIMIC-CXR and IU-Xray, our framework consistently improves both language metrics (BLEU-4, ROUGE-L, CIDEr) and clinical metrics (RadGraph F1, CheXbert), and remains robust to noisy or imperfect error tags.
Paperid: 3741,   Poster  
Authors: Yitian Chen, Shigeng Zhang, Xuan Liu, Mingming Lu, Kai Chen, Zhu Hongye, Xinning Chen
Title: Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
Abstract: Avoiding catastrophic forgetting for previous tasks and maintaining model plasticity to support new tasks are two critical objectives of continual learning. However, existing methods usually neglect one of the two aspects and fail to support long task sequences with satisfactory performance, especially in resourceconstrained scenarios in which the size of the model is limited. This work proposes GRAPA, a parameter-efficient continual learning method that well balances stability and plasticity of the model to handle long task sequences with diverse complexities. GRAPA enhances model plasticity without sacrificing stability with two novel designs. First, a gradient-guided parameter reuse strategy is proposed to make full use of frozen parameters while ensuring that no task interference is introduced. Second, a reinforcement-learning-based parameter allocation is designed to enable the model to adapt to the current task on top of reused parameters while preserving maximal model capacity for future tasks. Experiments on multiple task sequences composed of various datasets demonstrate that GRAPA lifts mean task accuracy by up to 7.67%, with up to 14.92% gains on subsequent complex tasks, reflecting GRAPA’s superior plasticity.
Paperid: 3742,   Poster  
Authors: Qiuhai Yan, Kang Chen, Zhengjie Lu, Tingting Wang, Faming Fang, Guixu Zhang
Title: Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
Abstract: Implicit neural representation (INR) based methods learn a continuous mapping from a lowresolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM²ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPMF) module for feature enhancement; an Anatomy-Guided Dual-Domain Cross Attention (AG-DDCA) module for joint spatial-frequency modeling; and an Anatomy-Guided Gaussian Parametrizer (AGGP) that leverages gradient-based sparse attention to concentrate Gaussian centers on critical anatomical structures. Extensive experiments on multiple datasets demonstrate that GaussM²ASR surpasses state-of-the-art methods in recovering fine anatomical details. The source code will be made publicly available upon acceptance.
Paperid: 3743,   Poster  
Authors: Maitreya Patel, Jingtao Li, Weiming Zhuang, Yezhou Yang, Lingjuan Lyu
Title: VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
Abstract: We introduce an efficient, resolutionagnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32–256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024×1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen—whose inference FLOPs grow quadratically with resolution (\approx11T FLOPs at 1024×1024)—VibeToken-Gen maintains a constant 179G FLOPs (63.4× efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.
Paperid: 3744,   Poster  
Authors: Tengfei Ma, Weiran Pan, Wei Wei
Title: TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise
Abstract: Finetuning large-scale Vision-Language Models (VLMs) is crucial for specialized tasks, but their performance is often undermined by the label noise prevalent in real-world datasets. Traditional approaches to learning with noisy labels typically rely on a self-referential loop, using a model's own predictions to correct errors. While recent VLM-specific methods have begun to leverage cross-modal information to aid in noise detection, we explore an alternative direction: using the text modality not just to identify noise, but to establish a source of ground truth that is fully independent of the training data's potentially corrupt labels.To this end, we propose Text-ANchored Guided Optimization (TANGO), a framework centered on ``semantic anchors"—a set of pure, immutable reference points generated from diverse text descriptions. Building upon these anchors, our approach reframes two key aspects of learning with noisy labels. First, we replace the conventional linear classifier with a parameter-free Text-Anchored Classifier, making predictions a direct, weighted consensus of the clean anchors. Second, we introduce an Anchor-Guided Refinement mechanism that validates each sample's given label against this external ground truth, providing a more robust signal for sample selection and label correction. Extensive experiments demonstrate that this approach achieves competitive and often state-of-the-art performance. Our code will be publicly available.
Paperid: 3745,   Poster  
Authors: Xinxing Yu, Ajian Liu, Sunyuan Qiang, Hui Ma, Liying Yang, Yuzhong Wang, Zhi Rao, Yanyan Liang
Title: PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
Abstract: Scenelevel point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances achieved in the field through existing methods, the sample-independent modelling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across different scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into a continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space, and thereby achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.
Paperid: 3746,   Poster  
Authors: Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan
Title: Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Abstract: The emergence of Large VisionLanguage Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks.To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.
Paperid: 3747,   Poster  
Authors: Sriram Narayanan, Mani Ramanagopal, Srinivasa G. Narasimhan
Title: Dual Band Video Thermography: Separating TimeVarying Reflection and Emission Near Ambient Conditions
Abstract: Longwave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.
Paperid: 3748,   Poster  
Authors: Jinzhao Li, Yinuo Chen, Dongxu Piao, Panwang Pan, Yifan Yu, Dong Wang, Honglei Yan, Liang Yue, Shaofei Wang, Yixin Chen, Siyuan Huang, Miao Liu
Title: EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy
Abstract: Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chainof-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.
Paperid: 3749,   Poster  
Authors: Yang Xiao, Weiming Liu, Jun Dan, Tengyue Xu, Fan Wang, Hua Yu, Junhao Dong, Jiao Liu, Shunjie Dong, Lianyong Qi
Title: Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport
Abstract: Semisupervised learning (SSL) has achieved remarkable progress by leveraging both limited labeled data and abundant unlabeled data. However, unlabeled datasets often contain out-of-distribution (OOD) samples from unknown classes, which can lead to performance degradation in open-set SSL scenarios. Current OOD detection methods are constrained by the absence of labeled OOD samples. Although optimal transport (OT) has proven to be effective to provide pseudo OOD scores for supervised learning, it still faces two main challenges, i.e., the unavoidable problem of finding the optimal transport plan and the unreliable OOD score caused by dense solutions. To overcome thess limitations, we propose a novel open-set OOD detection model namedDREW, which leveragesDynamicREWeighting approach for OT-based OOD detection. Specifically, we start by formulating OOD detection as a semi-unbalanced optimal transport (SemiUOT) problem. The proposed DREW model can dynamically transform SemiUOT into the classical OT formula and then directly obtain the pseudo OOD score from the new source distribution weights. Contrary to existing OT-based methods, DREW provides theoretically grounded and more accurate pseudo OOD scores, while avoiding the direct computation of the transport plan. Empirical results demonstrate the superiority of DREW in terms of both accuracy and efficiency. Extensive analytical experiments are conducted to elucidate the properties of each component.
Paperid: 3750,   Poster  
Authors: Yunpeng Fang, Yimu Sun, Jingxing Guo, Huisi Wu, Jing Qin
Title: Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
Abstract: Automatic and accurate echocardiography video segmentation is essential for efficient and repeatable measurements of key clinical functional indicators for diagnosis of cardiovascular diseases. However, it is an extremely challenging task to obtain highquality segmentation results throughout the cardiac cycle owing to (1) the inherent speckle noise in echocardiography videos, (2) the complex dynamic motions of cardiac structures, and (3) the scarcity of annotated data. To comprehensively address these challenges, we propose a novel semi-supervised model, which can achieve accurate and real-time echocardiography video segmentation with very limited annotations. The proposed model has two core innovative technologies. First, we propose a new anchor semantic awareness (ASA) module composed of an anchor recalibration (ARC) scheme and a temporal semantic fusion (TSF) algorithm. The former refines ambiguous feature regions by aligning them with learnable anchors, and the latter propagates structural semantic prototypes across frames to enhance boundary delineation and temporal consistency. Second, based on ASA, we developed a continuous pseudo-label reforging (CPR) module to gradually integrates high-quality pseudo-label through lightweight channel-wise attention, and reforge pseudo labels to provide more robust supervision.We extensively evaluated our method on two benchmarking datasets: CAMUS and EchoNet-Dynamic; experimental results show that our model outperforms SOTAs in segmentation accuracy while maintaining real-time performance. Codes will be publicly available upon publication.
Paperid: 3751,   Poster  
Authors: Qing Zhang, Xuesong li, Jing Zhang
Title: Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
Abstract: What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode partlevel geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free, zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.
Paperid: 3752,   Poster  
Authors: Jianting CHEN, Dianzhi Yu, Irwin King
Title: Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning
Abstract: Incremental learning (IL) arises from the need to continuously update models under limited data and computational resources. Most existing IL studies focus on data‑scarce settings. They often develop complex methods that rely on heavy computation, while overlooking the computational resource constraints common in real‑world scenarios. This motivates us to formalize the problem of Computational Resource‑Aware Incremental Learning, which explicitly considers the computational budget during model training. To tackle this problem, we propose Smart Replay, an efficient memory rehearsal algorithm that adaptively allocates resources by scheduling the replay ratio across mini‑batches. We cast replay‑ratio optimization into an optimal control formulation that jointly minimizes new‑task and memory losses. We further propose a heuristic Qfunction to guide ratio adjustments, adaptively balancing short-term efficiency and long-term stability. Finally, we develop a practical algorithm that periodically updates the replay ratio during training. Experiments on multiple benchmarks validate that Smart Replay consistently outperforms fixed‑replay baselines, achieving higher accuracy and lower forgetting under the same computational budget.
Paperid: 3753,   Poster  
Authors: Li Xu, YingFu Zhang, Kepeng Xu, Gang He, Yunsong Li
Title: Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
Abstract: With the rapid rise of Large Vision Language Models (LVLMs) for image understanding, the objective of image compression is gradually shifting from human visual perception to machineoriented semantic understanding. However, conventional learned compression techniques are optimized for pixel-level fidelity and typically operate at fixed or rigid bitrate points, misaligned with the semantic consistency and flexible bitrate control. This gap becomes critical in ultra-low-bitrate regimes, where latent representations often ignore semantic relevance and struggle to disentangle meaningful content from redundant visual details as the bitrate varies. To address these challenges, we develop a token-based flexible compression framework, Token Flow Guided Compression (TFGC), which unifies human- and machine-oriented objectives. TFGC supports variable bitrate control in ultra-low bitrate regimes and enables LVLMs to directly process compressed token without image reconstruction. Specifically, we explore token flow phenomenon in 1D token sequences and exploit it to design token flow propagation, which predicts missing tokens by propagating contextual information from unmasked tokens. Moreover, token semantic guidance aligns compressed representations with LVLM semantic space, while a progressive semantic alignment training strategy further bridges the gap between perceptual reconstruction and semantic reasoning. Experiments show that our framework achieves state-of-the-art LVLMs understanding at comparable bitrates while maintaining satisfactory perceptual quality.
Paperid: 3754,   Poster  
Authors: Yuzhuang Yang, Xiaolin Tian, Qigong Sun
Title: Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition
Abstract: Due to the inherent ambiguity of facial expressions and subjectivity in dataset labeling, learning with noisy labels remains a critical challenge in facial expression recognition (FER). The supervisory mechanism of teacherstudent network offers a promising approach for noisy-labeled FER. However, this approach is prone to noise accumulation and gradual coupling between the teacher and student parameters during training. We propose an Optimal Teacher Pool-driven dynamic label Noise Suppression framework for facial expression recognition (OTP-NS). Specifically, we construct an optimal teacher pool architecture that dynamically maintains multiple best teacher models while fusing their predictions, thereby mitigating noise accumulation and coupling of teacher-student parameters via update mechanisms. Furthermore, we develop two sample-level noise suppression parts: (1) Similarity-Aware Label Smoothing (SALS), diverging from the static smoothing strength in traditional label smoothing, automatically modulates the smoothing strength for teacher model based on prediction-label similarity, achieving fine-grained noise suppression. (2) Confidence-Weighted Logits (CWL), adaptively adjusting the classification loss of student model based on sample-to-centroid confidence metrics, alleviates the detrimental effects of noisy samples on model training. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art approaches across various noise levels, validating the effectiveness of our proposed framework in learning robust representations from noisy data.
Paperid: 3755,   Poster  
Authors: Huakeng Ding, Yaowen Chen, Kun Zhou, Hongzhi Wu
Title: Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
Abstract: We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, highquality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.
Paperid: 3756,   Poster  
Authors: Lizhi Xiong, Jun Li, Ziqiang Li, Weiwei Jiang, Zhangjie Fu
Title: No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
Abstract: Recent advances in diffusion models have enabled highfidelity, identity-preserving image generation for personalized applications such as digital avatars and virtual try-on systems. However, their reliance on sensitive facial reference images raises growing privacy concerns. Existing defense mechanisms are primarily designed for training-based personalization and struggle to generalize to emerging training-free approaches, due to fundamental differences in their identity integration paradigms. To bridge this gap, we propose IDGuardian—the first generalizable and model-agnostic identity protection framework capable of defending against both training-based and training-free personalization methods. IDGuardian abstracts the personalization process into two critical stages: identity extraction and identity injection. It then introduces crafted adversarial perturbations to simultaneously disrupt both stages. Specifically, it degrades the identity features extracted by external encoders and establishes an adversarial conceptual bridge that misdirects the generative trajectory away from the target identity. Extensive experiments show that IDGuardian effectively protects identity across various personalization pipelines and model architectures, while remaining robust to post-processing, adaptive attacks, and cross-dataset generalization.
Paperid: 3757,   Poster  
Authors: Yajun Liu
Title: Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
Abstract: Semisupervised medical image segmentation (SSMIS) aims to alleviate annotation scarcity, but general methods, often developed on few-class datasets, suffer performance degradation in class-imbalanced multi-organ scenarios. Existing class-imbalanced SSMIS methods also struggle, as their single-decoder architecture is forced to handle vastly different scales with shared parameters. This process is easily dominated by majority classes, fundamentally limiting tail-class segmentation capability.To address this, we propose a ‘’Divide, Conquer, and Aggregate" (DCA) framework, featuring a unified encoder, three expert decoders, and an aggregation decoder. First, we Divide by applying a Logarithmic Gap Analysis to statically partition foreground classes into stable Head, Medium, and Tail sets, which aligns with anatomical priors. Then, we Conquer by training the three architecturally asymmetric experts independently using a label-split strategy. This fundamentally alleviates the burden on a single decoder. The experts' predictions on unlabeled data are fused via logit stitching to generate high-quality pseudo-labels. Finally, we Aggregate using an aggregation decoder with a Dynamic Feature Aggregation Module (DFAM), which dynamically fuses priors from all three experts to achieve unbiased predictions and fully leverage unlabeled data. Experiments demonstrate that our DCA framework significantly outperforms state-of-the-art general and class-imbalanced SSMIS methods.
Paperid: 3758,   Poster  
Authors: Zirui Xu, Xianhang Chu, Li jiahao, Xu Yang, Cheng Deng
Title: Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation
Abstract: CloudEdge Continual Test-Time Adaptation (CTTA)—with edge devices processing real-time data and the cloud offering strong computing power—is a critical paradigm for models that adapt to dynamic data distributions in real-world scenarios. However, most existing frameworks assume architectural homogeneity between cloud and edge CNNs, which poses a significant performance bottleneck, particularly given the rapid emergence of Transformer-based models. Current methods fail to bridge this architectural gap, resulting in significant deficiencies in adaptation accuracy and practical applicability. To address this, we propose a novel Cross-Architecture Adaptation (CAA) framework for heterogeneous Cloud-Edge CTTA that enables effective adaptation to shifting data distributions. Specifically, CAA deploys a large Transformer-based teacher model on the cloud for robust feature extraction and prediction, and a lightweight CNN-based student model on edge devices to fit resource constraints. Based on such cloud-edge models, a synergistic edge-to-cloud communication strategy, Multi-criteria Dynamic Cross-domain Sampling, ensures only the most informative, class-balanced samples are uploaded, minimizing communication costs while guaranteeing stable, unbiased adaptation. Moreover, a Multi-level Adaptive Heterogeneous Distillation module is proposed to facilitate effective knowledge transfer across the architecturally disparate models, and improve the learning efficiency of the edge one. Experiments on several benchmarks demonstrate that CAA achieves state-of-the-art performance with low edge resource consumption and minimal edge-to-cloud communication overhead.
Paperid: 3759,   Poster  
Authors: Kepan Nan, Wangbo Zhao, Penghao Zhou, Jun Li, Zhenheng Yang, Jian Yang, Ying Tai
Title: ARCache: Mitigating Error Accumulation for Caching-based Acceleration in Autoregressive Video Diffusion Models
Abstract: Cachingbased acceleration methods have recently driven significant progress in efficient video generation with diffusion models. However, we identify a critical limitation when directly applying these acceleration techniques to autoregressive video diffusion models, which generate long videos by sequentially synthesizing segments conditioned on historical context. In such settings, any approximation errors introduced by acceleration tend to propagate and accumulate over time, resulting in severe error accumulation and progressive degradation of video quality. To address this challenge, we propose ARCache, the first training-free caching-based acceleration framework specifically designed for autoregressive video diffusion models. ARCache improves both the timing and quality of caching through two key components. First, History-Guided Cache (HGC) leverages historical information to adaptively schedule caching for each segment, enabling more accurate and efficient cache utilization. Second, Enhanced Residual Correction (ERC) adaptively approximates model residuals and refines the residual trajectory for subsequent segments, effectively mitigating error accumulation while simultaneously reducing computational overhead. Extensive experiments on Framepack-F1, SkyReels-V2, and autoregressive world model Matrix-Game demonstrate that ARCache achieves state-of-the-art acceleration and visual fidelity.
Paperid: 3760,   Poster  
Authors: Komal Komal, Mukul Gupta, Saumya Singh, SANTOSH VIPPARTHI, Chakradhar Reddy Chandupatla, Subrahmanyam Murala
Title: QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
Abstract: We present QuCNet, a hybrid quantum classical network for efficient remote sensing image classification. QuCNet integrates a lightweight convolutional encoder with sixteen parallel fourqubit trainable quantum circuits (TQCs) trained under a Hybrid Cyclic Weight-Sharing (HCWS) strategy. This design enhances expressibility while keeping the parameter count extremely low ~87K, 85× smaller than prior hybrid models). Guided by expressibility analysis, the proposed quantum configuration maintains stable gradients and mitigates barren plateaus on near term quantum devices. Extensive experiments across seven remote sensing benchmarks (AID, AIDER, UC Merced, NWPU-45, EuroSAT, IIITDMJ Smoke, and USTC SmokeRS) demonstrate that QuCNet consistently improves accuracy and generalization over classical CNN baselines. Furthermore, hardware only inference on IBM Quantum processors (ibm_torino, ibm_fez) confirms robustness under realistic noise and connectivity constraints. These results suggest a practical path toward scalable, hardware feasible quantum deep learning for geospatial applications.
Paperid: 3761,   Poster  
Authors: XUEGE HOU, Wenshuo Li, Yali Li, Han Shu, Yuan Wang, Xinghao Chen, Shengjin Wang
Title: VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
Abstract: VisionLanguage Models (VLMs) often over-rely on linguistic priors even when images are provided, leading to object hallucinations. We revisit object-wise hallucination from the perspective of how visual evidence shapes the model’s uncertainty. For each input, we measure decision uncertainty with and without the image, and define a Visual Evidence Sensitivity (VES) signal as the image-attributable change in entropy. Building on this signal, we introduce Visual Evidence Sensitivity Reinforcement Fine-Tuning (VES-RFT), a training-time reinforcement fine-tuning method that explicitly rewards reliance on correct visual evidence. We pair this continuous, annotation-free signal with a verifiable reward that enforces factual object correctness by automatically checking generated object mentions against the image, yielding a computable objective without human annotations. We optimize the dual objective using critic-free GRPO with KL regularization, requiring only parallel image and no-image passes during training while preserving single-pass inference. Across multiple VLM families and benchmarks, VES-RFT consistently suppresses hallucinations and improves robustness under ambiguity without degrading general language ability. Specifically, on LLaVA-7B, VES-RFT reduces 12.8 and 1.8 on CHAIR_S/CHAIR_I of MS-COCO, and increases POPE accuracy by 4.92%. Extensive experiments indicate that turning uncertainty into a learnable reward, paired with verifiable correctness signals, provides a scalable mechanism for training-time hallucination mitigation and stronger visual grounding.
Paperid: 3762,   Poster  
Authors: Junjian Li, Hulin Kuang, Jin Liu, Hailin Yue, Mengshen He, Jianxin Wang
Title: Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification
Abstract: Multiple Instance Learning (MIL) has emerged as the dominant paradigm for the analysis of gigapixelscale Whole Slide Images (WSIs). However, recent methods leveraging guidance from Vision-Language Models often rely on static and universal pathological descriptions. This one-size-fits-all strategy fails to account for the vast morphological heterogeneity within individual WSIs, as its uniform guidance is not tailored to slide-specific visual evidence. To address this, we propose DyKo, a Dynamic Knowledge-guided MIL framework that adapts universal knowledge to slide-specific evidence for few-shot WSI classification. The core of DyKo is the WSI-Adaptive Knowledge Instantiation module (WAKI). WAKI begins by identifying key visual prototypes within a specific WSI's histology. These slide-specific prototypes then serve as queries to retrieve relevant concepts from a pathology knowledge base. This retrieved knowledge is then used to synthesize unique, knowledge-instantiated features for each instance, effectively instantiating tailored guidance at the patch level. To ensure fidelity and prevent semantic drift, we introduce a Structural Consistency loss that enforces alignment between knowledge-instantiated and visual features. Comprehensive experiments on four public real-world cancer datasets demonstrate that DyKo achieves superior performance over state-of-the-art methods in few-shot pathology diagnosis. Code will be made publicly available upon paper acceptance.
Paperid: 3763,   Poster  
Authors: Zhuwei Wen, Zimin Xia, He Chen, Linwei Yue, Xianwei Zheng
Title: Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening
Abstract: In remote sensing pansharpening, spectrally mixed regions, where the spectral interactions among adjacent land covers lead to highly inconsistent reconstruction patterns, remain the most challenging areas. Due to the complex spatial distribution and heterogeneous spectral characteristics of ground objects, existing methods relying on rigid architectures and physical constraints struggle to learn generalized reconstruction patterns from limited spectral mixing samples, resulting in unstable generalization. To address this limitation, we propose an architectureagnostic regularization-guided mechanism that adaptively directs the model to focus on learning reliable reconstruction priors for challenging regions. Specifically, we introduce a simple data-level transformation, MixShuffle, which performs random convex combinations across spatial positions and spectral channels to generate training data with richer spatial structures and stronger spectral mixing. In parallel, we propose a hierarchical attention weighting mechanism, a loss-level gradient reallocation strategy at the sample, channel, and pixel levels, enabling the model to emphasize structurally complex regions. Extensive experiments on multiple benchmark datasets (WV3, GF2, QB) and across various network architectures demonstrate the strong generality and effectiveness of the proposed strategies, achieving state-of-the-art performance when integrated into DANet.
Paperid: 3764,   Poster  
Authors: Guangyan Chen, Qi Shao, Te Cui, Zichen Zhou, Weixin Mao, Luojie Yang, Meiling Wang, Yi Yang, Hua Chen, Yufeng Yue
Title: Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency
Abstract: Video data provides a rich source beyond expensive actionlabeled data for advancing robot learning. Recent approaches have demonstrated promising potential in leveraging video data by learning latent actions for policy training. The latent action tokenizer encodes latent actions between successive video frames, and the tokenizer is trained to reconstruct future frames using current frames and the encoded latent actions. However, the unique pairing of successive frames permits future frame reconstruction with little understanding of transition dynamics, hindering the learning of semantically consistent latent actions. Moreover, the tokenizer typically allocates distinct latent action subsets to individual embodiments to accommodate heterogeneous morphologies, constraining knowledge transfer. To overcome such limitations, we propose the action-centric cycle consistency, aiming to establish a unified latent action space. Our method samples latent actions from the latent action space and decodes them with video frames to generate diverse subsequent frames, then enforces cycle consistency by predicting the sampled actions from both original and generated frames. Our concise method creates a challenging task that learns corresponding latent actions from current frames and diverse generated future frames, compelling the tokenizer to develop semantically consistent action representations. Additionally, sampled latent actions can be applied to video frames from distinct embodiments, facilitating the alignment of latent actions across embodiments. Experiments demonstrate that our approach achieves a 20.1% improvement over OpenVLA on the LIBERO benchmark and increases the average length from 3.27 to 3.93 on the CALVIN benchmark. In real-world experiments, our method maintains strong performance with a 44% improvement.
Paperid: 3765,   Poster  
Authors: KAI LI, Jiafeng Li, Lianghua He, Ying Wen
Title: DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning
Abstract: In ClassIncremental Learning (CIL), parameter efficient fine-tuning applied to Pre-trained Models (PTMs) remain vulnerable to catastrophic forgetting as they adapt to new tasks. The prevalent strategy to mitigate catastrophic forgetting is to constrain gradients within the orthogonal subspaces of past tasks, while rigid gradient constraints hinder plasticity. In this paper, we propose a novel CIL framework, Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation (DGS), that balances stability and plasticity via gradient fusion and maintains representation consistency through classifier and patch-token alignment. Specifically, our method introduces the Dual Gradient update strategy that first derives a base subspace projection from the PTMs and then fuses task-specific LoRA gradients with their aligned counterparts through interpolated combination. This design promotes knowledge retention without sacrificing task-specific expressiveness. Furthermore, we employ a Classifier Alignment mechanism with Semantic shift estimation which is based on the calibrated prototype statistics to mitigate classifier shift, and introduce a novel Patch-level Alignment loss to preserve feature consistency across tasks. Extensive experiments on six standard benchmarks demonstrate that our approach consistently outperforms existing CIL methods, highlighting its effectiveness and generalization capability in continual learning scenarios.
Paperid: 3766,   Poster  
Authors: Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao, Liqiang Nie
Title: ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
Abstract: Current VisionLanguage-Action (VLA) models primarily focus on mapping 2D observations to actions but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we proposeConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D-Perception and 4D-Reasoning. Specifically, we design:1) CV-Aligner, which ensuresCross-View object semantic consistency via filtering instruction-relevant regions and aligning object identities across multiple viewpoints;2) CO-Fuser, which guaranteesCross-Object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce3) CS-Thinkerto achieveCross-Scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves21.6%and41.5%performance improvements, along with2.3×and2.4×inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.
Paperid: 3767,   Poster  
Authors: Haoming Yang, Ke Ma, ligonf zhang, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang
Title: Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models
Abstract: \beginabstractTextto-Image (T2I) models have achieved significant progress in generating high-quality images, with compositional visual generation emerging as an important capability that enables them to synthesize coherent, natural scenes from multiple discrete concepts. However, this powerful compositionality, while enhancing creativity, also introduces new safety risks: combinations of different concepts can produce high-risk images without explicitly expressing harmful content. Motivated by this, we propose CoRA (Composable Reassembly Attack): an attack method that preserves the original semantics while bypassing safety filters. Unlike traditional compositional generation approaches that rely on modifying the sampling process, CoRA operates solely in the text space under a black-box setting, iteratively rewriting and guiding prompts through interactive steps. Specifically, CoRA decomposes a potentially harmful intent into a set of fine-grained, superficially benign but semantically complete visual elements, and then uses iterative selection and reassembly to guide the target T2I model to recombine these elements without triggering safety checks, thereby recovering the original malicious semantics. Experimental results show that CoRA significantly improves attack success rates (ASR) across several mainstream open-source and commercial T2I models, producing higher-risk outputs while maintaining semantic consistency.\textcolorredWarning: This paper contains model-generated content that may be considered offensive or disturbing.
Paperid: 3768,   Poster  
Authors: Hao Zheng, Hu Wang, Tiantian Zheng, Prajjwal Bhattarai, Tuka Alhanai
Title: Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Abstract: Dualhand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12× larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at \urlhttps://github.com/polyphony-cvpr/polyphony.
Paperid: 3769,   Poster  
Authors: Gokul Srinath Seetha Ram, Rashmi Elavazhagan
Title: The Drift Kernel: Why Diffusion Models Change Even When Told Not To
Abstract: Even when told to “do nothing,” modern diffusion models subtly alter their output relative to the input they are supposed to preserve. We call this effect NoOp Drift. We introduce the Drift KernelK_M(\sigma)=\mathbbE[|I' - I_0|_2^2 \mid \sigma],the expected perceptual deviation induced when running a diffusion model at noise strength \sigma under a null instruction. Using 120,000 baseline samples (30,000 per model across SD15, SD21, SDXL, and InstructPix2Pix) and 9,600 ablation samples (four strengths, null vs. strict copy prompts), we show that variance-driven diffusion models follow a quadratic formK_M(\sigma)\approx k_M\sigma^2 + c_M,with aggregate R^2=0.97. We derive this scaling from first principles via a Taylor expansion of the decoder, yielding k_M=\mathrmTr(J_D J_D^\top), which depends only on the decoder Jacobian—not on prompts. To validate mechanistic structure, we construct synthetic decoders that reproduce the two regimes seen in practice: quadratic variance-driven drift and flat, high-variance edit-driven drift. We show that prompt wording has negligible effect (<17% coefficient difference), proving that drift is structural, not prompt-induced. We release NoOp-Bench, a benchmark with 10,000 inputs and full code for reproducible kernel estimation. Additional proofs, ablations, LPIPS/CLIP metrics, and extended visualizations appear in the Supplementary Material.
Paperid: 3770,   Poster  
Authors: Lingyun Dai, Zehao Chen, Yan Liu, Shi Gu, Peng Lin, De Ma, Huajin Tang, Qian Zheng, Gang Pan
Title: Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
Abstract: Novel view synthesis for dynamic scenes remains challenging due to complex motion variations.Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of static and dynamic Gaussian primitive still limits performance.We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of finegrained motion details, and overfitting on input views, resulting in degraded side-view synthesis.To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic–static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints.Experiments show that our method achieves state-of-the-art rendering quality and real-time performance.
Paperid: 3771,   Poster  
Authors: Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, ZhouNanJin ZhouNanJin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, Xiangyang Xue
Title: DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Abstract: Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feedforward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head (DGSH) that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy and temporal consistency, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.
Paperid: 3772,   Poster  
Authors: Hiromichi Kamata, Samuel Munro, Fuminori Homma
Title: B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta–Bernoulli Bayesian Updates
Abstract: Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for realtime editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose B^3-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy (1-1/e) approximation to the optimal view sampling policy.Experiments on multiple datasets show that B^3-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds.The results demonstrate that B^3-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.
Paperid: 3773,   Poster  
Authors: Jiyuan Liu, jia lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu
Title: M⁴-SAM: Multi-modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
Abstract: The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to RGBD video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M^4-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LoRA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M^4-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.
Paperid: 3774,   Poster  
Authors: Rong Xu, Runqi Wang, Yingjun Zhang, Tao Tao, Xiaomeng Li, Liping Jing
Title: TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
Abstract: Weakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only videolevel labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of WLI in anomaly detection.Moreover, considering that anomalies typically occur in dynamic foreground regions, we further design a motion-aware feature enhancement module that extracts dynamic areas within each video segment to emphasize the representation of critical features.This not only improves the accuracy of anchors in triplets, but also enhances the discriminative power of instance features in MIL. Extensive experiments on UCF-Crime, XD-Violence, and MSAD datasets demonstrate the effectiveness of our approach.
Paperid: 3775,   Poster  
Authors: Yang Liu, Jiajin Zhang, Yaojun Hu, Bingguang Hao, Xin Cao, Yingda Xia, Danyang Tu, Shi Gu, Ling Zhang
Title: Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
Abstract: A faithful decisionmaking process requires models to ground human-understandable concepts both spatially (where they appear in the image) and causally (how they influence the prediction). Recent advances in Vision–Language Models (VLMs) enable concept-level alignment and have inspired Concept Bottleneck Models (CBMs), which explain predictions by mapping image representations to human-understandable concepts, allowing users to trace decisions through explicit semantic reasoning. However, existing CBMs suffer from two key inconsistencies. First, semantic inconsistency: VLMs often fail to localize fine-grained part–attribute concepts, producing noisy or incomplete masks. Second, object inconsistency: object-agnostic concepts such as "head: streamlined front profile" may describe multiple categories (e.g., fish or human); without enforcing object identity, non-targeted regions can introduce spurious evidence that corrupts the bottleneck representation. To address these challenges, we propose a new Object-Aware Concept Bottleneck Model (OA-CBM) that jointly enforces semantic- and object-level consistency. Specifically, (1) we redefine concepts as part–attribute pairs to enhance VLM robustness at the semantic level, and (2) introduce class-agnostic object clustering to suppress irrelevant visual evidence. We further annotate two grounding datasets with part–attribute descriptions and conduct extensive experiments. Results demonstrate that OA-CBM produces more faithful and robust explanations while maintaining competitive predictive performance.
Paperid: 3776,   Poster  
Authors: Shuang Hao, Pengfei Ren, Haifeng Sun, Ting Pan, Qi Qi, Lei Zhang, Cong Liu, Jianxin Liao, Jingyu Wang
Title: A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
Abstract: Controllable hand image generation aims to synthesize geometrically accurate images with consistent appearance. Recently, diffusion models have achieved remarkable success in image generation and have been applied for hand image synthesis. Through inputlevel fusion or feature-level modulation, existing methods inject control signals with fixed strength across all denoising timesteps. However, this static modulation ignores the progressive characteristic of the denoising process. In this paper, we reveal that the modulation process of control signals depends on the denoising state and the conditions complexity. Due to their distinct semantic distributions and information densities, it remains challenging to achieve effective interaction of these heterogeneous representations. To address this, we propose a Temporal and Content Co-Awareness Latent Diffusion method that designs a temporal- and content-driven modulation mechanism for controllable hand image generation. To achieve temporal and content co-awareness among the heterogeneous representations, we propose a query-based interaction mechanism designed to mitigate information redundancy and align semantic distributions. Leveraging this cross-domain interaction, the model infers the control information required at the current denoising state and dynamically adjust pose and appearance injection strengths. To obtain a stable appearance representation from multi-pose images of the same identity, we design the Pose-Invariant Appearance Encoder that captures both global appearance consistency and local texture details. Furthermore, we employ a feature orthogonal decomposition to mitigate pose leakage into appearance subspaces. Both quantitative and qualitative experimental results demonstrate the superiority of our method over the state of the arts.
Paperid: 3777,   Poster  
Authors: wu chenyang, Lina Lei, Fan Li, Chun-Le Guo, Dehong Kong, Xinran Qin, Zhixin Wang, Ming-Ming Cheng, Chongyi Li
Title: YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
Abstract: Recent advances in Diffusion Transformer (DiT)based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10 FPS, primarily due to dense computations over the entire spatiotemporal token space—even when only a small masked region actually requires processing. In this paper, we present YOSE — You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions — in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5x speedup in 70% of cases while maintaining visual quality comparable to the baseline. The code will be made publicly available.
Paperid: 3778,   Poster  
Authors: Yijian Tian, Mingtao Ou, Pan Zijian, Xinglong Ji
Title: SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and realtime novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these limitations with a sparse, edge-guided online 3DGS framework. Our method represents the scene as an edge-aligned sparse Gaussian map and estimates 6-DoF camera poses by aligning rendered 3D edges with observed 2D edges using a distance transform based objective, yielding roughly 2× faster per-iteration pose optimization than existing Gaussian-based systems while recovering clear scene geometry. We further leverage a dual-channel hybrid pixel vision sensor that outputs blur-free, high-frame-rate spatial-difference edge signals alongside RGB images, and use these signals both for robust edge-based tracking and for a mutual supervision scheme that mitigates motion blur in dense 3D reconstruction. Our system maintains stable tracking and high-fidelity geometry under extremely high-speed motion, where existing RGB-only methods fail, while remaining compatible with standard RGB cameras and achieving competitive tracking accuracy.
Paperid: 3779,   Poster  
Authors: Xinyi Chen, Hang Dong, Baowei Jiang, Shenkun Xu, Youqi Guan, Kanle Shi, Kun Gai, Haichuan Song
Title: $\alpha$Matte4K & $\mu$Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting
Abstract: Highresolution human video matting aims to predict accurate alpha mattes for semi-transparent regions while ensuring temporal consistency across frames.Despite notable progress, existing research remains limited by the insufficient quality of datasets, including (1) inaccurate alpha fractional values resulting from imperfect annotation, and (2) visual inconsistencies arising from arbitrary foreground-background compositions that lack natural coherence.In this paper, we introduce \alphaMatte4K, a large-scale 4K-resolution human video matting dataset, which achieves accurate annotations and physical consistency through physically based rendering (PBR).From model perspective, constrained by computational costs, current methods often up-sample alpha outputs to meet target resolutions that unavoidably diminishes precision.To overcome this critical limitation, we introduce \muMatting, a innovative resolution-agnostic two-stage matting framework for video matting: (1) coarse matte localization using a portrait-aware masked autoencoder; (2) refinement of critical regions via sparse 3D convolution, augmented by a temporal modulator that injects global spatio-temporal cues for enhanced consistency and contextual awareness. Extensive experiments show that \alphaMatte4K boosts baseline performance, while \muMatting surpasses state-of-the-art methods in accuracy and spatio-temporal consistency, driving applications in real-world scenarios.
Paperid: 3780,   Poster  
Authors: Tengfei Liu, Yijian Fan, Boyue Wang, Yongli Hu, Mingjie Li, Jinghua Li, Junbin Gao, Xiaojun Chang, Zhihui Li, Baocai Yin
Title: BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
Abstract: Radiology report generation (RRG) aims to automatically describe medical images via freetext reports. In clinical practice, comparing current and prior chest X-rays is essential for assessing disease progression, motivating the development of longitudinal RRG methods. However, most existing approaches often struggle to capture fine-grained temporal changes, as they often rely on unidirectional alignments or static reasoning pipelines, overlooking the bidirectional and asymmetric nature of disease evolution. To tackle these challenges, we propose BiOTPrompt, a novel framework for disease evolution-aware radiology report generation, which introduces a Bidirectional Optimal Transport (BiOT) mechanism to explicitly model progression dynamics between historical and current chest X-rays. By analyzing the asymmetry between bidirectional transport plans, BiOTPrompt can identify newly emerged and resolved regions, which are then used to construct dynamic prompts that guide large language models (LLMs) in generating clinically relevant diagnostic reports. Furthermore, we incorporate a vision-language consistency constraint to ensure alignment between visual evidence and textual descriptions, mitigating hallucinations and enhancing factual correctness. Extensive experiments on the Longitudinal-MIMIC dataset demonstrate that BiOTPrompt achieves state-of-the-art performance in both language metrics and clinical relevance, setting a new standard for longitudinal radiology report generation.
Paperid: 3781,   Poster  
Authors: Yiheng Dong, Yi Lin, Shilong Huang, Xiyan Yang, Xin Yang
Title: TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation
Abstract: Automatic radiology report generation (RRG) aims to translate medical images into diagnostic text, reducing radiologists' workload and standardizing clinical documentation. Nonetheless, existing approaches mainly focus on singletimepoint analysis and fail to capture temporal disease evolution across longitudinal examinations. While recent longitudinal RRG (LRRG) approaches incorporate historical data, they often combine images from different time points within a single representation space, leading to blurred semantics and inconsistent temporal reasoning. In this work, we propose a Temporal Decoupling with Iterative Mutual-Refinement Model (TIM), a two-stage framework that explicitly decouples spatial pathology from temporal progression and iteratively refines reports through mutual feedback. Stage I performs temporal-decoupled representation learning, separating temporal evolution patterns from disease-specific features and generating radiology reports for both prior and current studies. Stage II introduces a mutual report refinement mechanism that identifies diagnostic inconsistencies within prior reports and iteratively rectifies both prior and current reports through error-sensitive feedback. Experiments on the Longitudinal-MIMIC dataset demonstrate that TIM surpasses existing single-image and longitudinal baselines, achieving new state-of-the-art performance across both language and clinical metrics. Code is available in the supplementary materials.
Paperid: 3782,   Poster  
Authors: Yujie Wei, Chenglong Ma, Jianxiong Gao, Chenhui Wang, Shiwei Zhang, Biao Gong, Shuai Tan, Hangjie Yuan, Hongming Shan
Title: Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
Abstract: Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRIto-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics. Code and models will be made publicly available.
Paperid: 3783,   Poster  
Authors: Zhuoyang Zhang, Shang Yang, Qinghao Hu, Luke Huang, James Hou, Yufei Sun, Yao Lu, Song Han
Title: Plan, Imagine, then Act: Steering Your VLA with Efficient Visually Grounded Planning
Abstract: VisionLanguage-Action (VLA) models convert abstract language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visually Grounded Planning, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuomotor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image-generation module that predicts a high-quality 640×480 future observation from the current visual input and language instruction within only 0.33 s on an H100 GPU, together with a vision–language component that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on approximately 10 million multi-task, cross-embodiment samples, enabling it to learn robust embodied dynamics and achieve strong real-world generalization. We evaluate our framework on a benchmark consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the \pi_0 baseline (46.5%) and a +30.3% absolute improvement over \pi_0 augmented with textual subtask guidance (57.1%).
Paperid: 3784,   Poster  
Authors: Muhammad Naseer Subhani
Title: ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a selfprompting point-supervised framework that adapts SAM to RSIs using only sparse point annotation. Our method employs a Refine–Requery–Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation. We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
Paperid: 3785,   Poster  
Authors: Hyeonseong Kim, Hyun-Kurl Jang, Kuk-Jin Yoon
Title: Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
Abstract: LiDAR semantic segmentation must remain robust under various sensor and environmental corruptions to be reliable in safetycritical applications.Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts.To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective.Our method is based on geometric inlier discrimination (GeoID), which injects synthetic off-manifold points into the input and trains the model to distinguish geometry-consistent inliers from synthetically displaced outliers, enabling adaptation on unlabeled test data.To further stabilize this process under real corruptions, we introduce bidirectional unreliable point filtering (BiUPF), which uses inlier scores from the source-trained model to filter out unreliable regions on both original and synthetic points, focusing updates on high-confidence samples.Experiments on two large-scale corruption benchmarks, SemanticKITTI-C and nuScenes-C, show that our method consistently outperforms strong test-time adaptation baselines and improves robustness across diverse LiDAR corruptions.Code will be released.
Paperid: 3786,   Poster  
Authors: Weiheng Lu, An Yu, Jian Li, Zhenfei Zhang, Felix X. Ye, Ming-Ching Chang
Title: FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Abstract: Audiovisual large language models (AVLLMs) have made significant strides in understanding visual and auditory content. However, their ability to capture fine-grained temporal relationships between audio and visual streams remains insufficiently evaluated. To address this, we introduce FAVE (Fine-grained Audio-Visual Temporal Evaluation), a comprehensive benchmark targeting three core dimensions of temporal perception: cross-modal temporal alignment (FAVE-align), event temporal relationship (FAVE-low), and detailed moment captioning (FAVE-high). To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. Extensive experiments on twelve state-of-the-art multimodal LLMs, both open-source and closed-source, reveal key limitations in multimodal integration, temporal relationship and timestamp localization, especially for joint audio-visual tasks. These findings highlight the need for better temporal modeling to improve AVLLMs' understanding of real-world video content. FAVE serves as a rigorous testbed for advancing temporally aware multimodal systems, and will be publicly released upon acceptance.
Paperid: 3787,   Poster  
Authors: Leyuan Xing, huanjia zhang, Dongyu Pan, Hai Wu, Qiming Xia, Kezheng Xiong, Wen Li, Chenglu Wen, Cheng Wang
Title: TACO: TaskAware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
Abstract: Reliable navigation and decisionmaking of autonomous vehicles require both accurate localization and object detection. Traditionally, these two tasks are handled separately, leading to redundant computation and limited cross-task knowledge transfer. This paper proposes TACO, the first Task-Aware COntrastive learning framework, which performs joint LiDAR localization and 3D object detection within a single, unified network. TACO leverages contrastive learning to explicitly decouple and align static geographic features for localization and object-centric features for detection. This bidirectional mutual supervision not only enhances localization robustness in dynamic environments by filtering dynamic noise but also boosts detection accuracy via effective spatial context. Additionally, we propose OxfoLD, the first dataset that provides multi-traversal LiDAR localization ground truth with rich 3D object annotations, thereby supporting task validation across various times and weather conditions. Experimental results demonstrate that TACO achieves state-of-the-art localization accuracy while maintaining competitive detection performance. The code and dataset will be released.
Paperid: 3788,   Poster  
Authors: Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara
Title: ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
Abstract: Generating human motion with precise spatial control is a challenging problem. Existing approaches often require taskspecific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
Paperid: 3789,   Poster  
Authors: Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng
Title: Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Abstract: Spatial intelligence in visionlanguage models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
Paperid: 3790,   Poster  
Authors: Huanjing Yue, Dawei Li, Shaoxiong Tu, Jingyu Yang
Title: $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Abstract: Reconstructing High Dynamic Range (HDR) videos from sequences of alternatingexposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose \textF^2\textHDR, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a motion mask that identifies salient motion regions to guide ghosting and noise suppression, and a motion-aware refinement network that aggregates complementary information for coherent detail reconstruction. Extensive experiments demonstrate that \textF^2\textHDR achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations. Code will be released upon acceptance.
Paperid: 3791,   Poster  
Authors: Lucas Iijima, Yihao Luo, Roberto Sesia, Amit Kaura, Jamil Mayet, Choon Hwai Yap
Title: EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
Abstract: 3D echocardiography provides superior cardiac quantification to traditional 2D echocardiography, which suffers from geometric idealizations and imaging plane misalignment. However, despite its advantages, clinical adoption of 3D echo remains limited due to logistical and visualization challenges. We propose a novel framework that reconstructs the 3D shape of the left ventricle (LV) throughout the cardiac cycle from sparse 2D echocardiographic views routinely acquired in clinical practice, without the need for external hardware or manual tracking. Our method integrates EchoPOSE, a new deep network that automatically estimates the 6D pose (position and orientation) of LV segmentations, with a graphharmonic algorithm for 3D shape reconstruction. EchoPOSE employs a transformer-based architecture that combines local image features with global multi-view context, and introduces a geometry-aware loss to ensure spatial consistency across intersecting imaging planes. Trained and evaluated on large-scale synthetic data derived from 3D echocardiography and validated on prospectively acquired clinical echocardiograms, EchoPOSE achieves 3.78 mm and 8.65^\circ pose errors, yielding 87.5 Dice reconstruction accuracy, 1.44% ejection fraction error, and 3.03% volume error, outperforming alternative deep learning techniques and classical clinical approaches. Notably, the framework remains robust under suboptimal imaging alignment, suggesting that EchoPOSE can reduce the sonography skills required for transducer positioning and allow minimally trained clinicians to perform echo scans.
Paperid: 3792,   Poster  
Authors: Yihang Duan, Shuo Huang, Lizhang Lizhang, Meiling Wang, Li Zhang
Title: IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics
Abstract: Restingstate functional MRI (rs-fMRI) provides rich information for modeling brain connectivity in disease diagnosis. However, most existing brain graph learning methods rely solely on imaging data, leading to limited biological interpretability and poor integration of external medical knowledge. To address these challenges, we propose an Interpretability-Enhanced Brain Graph Learning (IEBGL) framework that anchors brain network modeling in large-scale medical knowledge. Our framework introduces two complementary modules: LLM-Instructed Topological Reconstruction (LITR) and Literature-Augmented Semantic Aggregation (LASA). LITR employs large language model (LLM) reasoning to refine brain connectivity and construct topological structure. LASA augments node representations by aggregating semantic information from biomedical literature, ensuring the model’s interpretability and relevance to clinical disease knowledge. Finally, the framework is trained with the Graph Bi-directional Mamba Network (GBMN) for disease diagnosis. Extensive experiments on the REST-meta-MDD and ABIDE datasets, together with 35,133 depression-related and 32,617 autism-related publications, demonstrate that IEBGL outperforms state-of-the-art methods in classification performance. Further analyses show that the LITR module reveals biologically meaningful alterations in brain connectivity, while the LASA module establishes interpretable associations between these regions and disease-related biomedical literature. Together, these mechanisms help IEBGL explain abnormal brain connections and their links to disease-related knowledge.
Paperid: 3793,   Poster  
Authors: YANG CHU, Xiaomeng Yang, Keli Deng, Yuntao Qian
Title: HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
Abstract: Hierarchical classification (HC) on degraded images presents challenges due to feature corruption, unreliable confidence estimation, and finegrained misclassification. Existing methods often struggle to balance semantic consistency and adaptive decision paths under low-quality visual conditions. To address this, we propose HierUQ, a unified framework that integrates uncertainty quantification with adaptive granularity reconciliation. A Vision Transformer backbone extracts global features, which are fused with semantic embeddings via bilinear and semantic-guided cross-attentions. We develop a principled Hierarchical Uncertainty Quantification (HUQ) strategy based on label smoothing and proper scoring rules. When confidence is insufficient, a Confidence-Aware Path Adjustment (CAPA) mechanism adaptively rolls back predictions to higher-level nodes, mitigating overclassification and error propagation, stabilizing the learning trajectory, overcoming degradation-induced interference, and enhancing fine-grained classification accuracy. To enhance learning, we introduce a self-paced joint optimization (MLJO) over multi-level objectives with dynamic loss weighting. Experiments on degraded remote sensing and natural image benchmarks show that HierUQ achieves state-of-the-art performance with strong robustness and adaptability.
Paperid: 3794,   Poster  
Authors: Yongchao Xu, Jiawei Liu, Junfeng Wang, Sen Tao, Na Jiang, Zheng-Jun Zha
Title: Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection
Abstract: OpenVocabulary Human–Object Interaction (OV-HOI) detection aims to recognize novel HOI categories beyond the training set.Existing OV-HOI detection approaches typically leverage CLIP to extract global visual representations and perform cross-attention between learnable queries and global features to localize human–object pairs.However, such one-stage paradigms tend to overfit seen interactions, limiting their generalization to unseen categories, while the coarse spatial awareness of CLIP also hinders the localization of fine-grained interaction cues.To address these issues, we propose a novel Semantic-Diversified and Interaction-Focused framework (SD-IF), which integrates reinforcement-guided adaptive optimization to jointly enhance semantic generalization and spatial discrimination.Specifically, we introduce a Semantic Diversification (SD) module that applies reinforcement-driven stochastic semantic perturbations and dual-level semantic exploration, expanding the semantic coverage of queries while maintaining visual coherence and effectively encouraging exploration beyond the seen semantic clusters.Furthermore, we design an Interaction Focusing (IF) module that formulates an actor–critic optimization scheme to adaptively refine attention distributions based on detection features and interaction representations, guided by a hybrid reward combining spatial focusing and semantic consistency.This cooperative learning paradigm enables the model to capture discriminative interaction cues and achieve spatially interpretable reasoning.Extensive experiments on two widely used benchmarks demonstrate that SD-IF achieves state-of-the-art performance, significantly surpassing existing OV-HOI detection methods.
Paperid: 3795,   Poster  
Authors: Huabin Wang, Xinyu Chen, Yuan Zhou, Fei Liu
Title: Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
Abstract: Positron Emission Tomography (PET) can be used for the early diagnosis of various brain disorders. However, the annotation of PET scans requires the involvement of specialized nuclear medicine experts, making accurately annotated PET data extremely scarce. MRIbased cross-modal domain adaptation methods can improve the brain disorder classification accuracy with sparsely labeled PET data. However, existing methods fail to balance the core requirements of domain discrepancy elimination and modality-specific discriminative information retention in cross-modal tasks. Forced alignment often undermines the core pathological discriminative features of both modalities, making it difficult to meet the collaborative optimization demands of cross-modal brain disorder classification. To address this, we propose a Dual-Stream feature Disentanglement and Alignment (DSDA) framework designed for collaborative optimization of cross-modal domain adaptation and brain disorder classification. This framework first dynamically evaluates and explicitly decouples the critical brain regions relevant to the classification task from the non-critical regions that preserve brain structural integrity. It then applies differential processing to the two types of brain regions: topology-weighted feature alignment for non-critical regions and high-confidence feature fusion for critical regions. This differential processing ensures that the model effectively aligns features while preserving key discriminative information. Extensive experimental results on various datasets (e.g., ADNI, AIBL, and PPMI) demonstrate the effectiveness of DSDA which helps achieve the state-of-the-art performance.
Paperid: 3796,   Poster  
Authors: Xindong Mao, Hang Li, Yuchen Wu, Jiahe Li, Xiao Bai, Jin Zheng
Title: CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
Abstract: Scene Coordinate Regression (SCR) has emerged as a memoryefficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we proposeCoLoR, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce feature consistency.Our method achieves state-of-the-art performance on multiple challenging large-scale datasets and significantly narrows the accuracy gap with classical feature matching based approaches while retaining a compact map size.
Paperid: 3797,   Poster  
Authors: Siwei Han, Haonian Ji, Siyang Xin, Juanquan Shi, Shi Qiu, Xinyu Ye, Peng Xia, Jiaqi Liu, Zhaorun Chen, Yiyang Zhou, Linjie Li, Lijuan Wang, Huaxiu Yao
Title: Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper
Abstract: Automatically generating clear and accurate figures for research papers remains challenging, as it requires semantic understanding, precise structure, and visual aesthetics. Existing approaches struggle to balance fidelity and quality: large language model (LLM) codebased methods (e.g., SVG, Mermaid) are structured but inflexible, while image-generation models (e.g., GPT-Image-1, Nano Banana) produce hard-to-edit and often inaccurate figures. We present Paper2Figure, a dual multi-agent system with an interactive web platform for paper-to-figure generation. Generation Agents convert text into our designed FigScript language, encoding figure semantics, styles and layout. The web system renders the FigScript into an initial image, which Refinement Agents iteratively analyze to locate issues and revise the FigScript for improved logic, alignment, aesthetics and text accuracy. Crucially, users can further refine results through an intuitive web interface, ensuring full control over the final output. To evaluate Paper2Figure, we introduce Paper2Figure Bench, a benchmark comprising 100 academic figures with paired descriptions. Experiments demonstrate that Paper2Figure markedly improves accuracy by 12%, beauty by 13.5%, and completeness by 17.0% over state-of-the-art baselines in fully automatic generation without human adjustment. By combining automated generation with interactive edit, Paper2Figure bridges the gap between AI assistance and researcher control, offering a practical solution for high-quality academic figure creation.
Paperid: 3798,   Poster  
Authors: Shuhan Miao, Biru Cao, Junling Zhuang
Title: Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
Abstract: This paper introduces Nestwork, a unified latentdiffusion framework for conditional 3D furnished house layout generation using a heterogeneous graph of rooms and furniture. Designing reasonable and controllable 3D layouts that reflect the underlying semantic structure of a house is a key challenge in AI-assisted architectural design. Existing graph-based methods either produce unfurnished multi-room layouts or generate furnished scenes one room at a time, preventing joint reasoning over room structure and furniture placement. Nestwork represents an entire house as a heterogeneous graph with typed room and furniture nodes and multiple spatial relations. A single unconditional autoencoder based on a heterogeneous graph attention network embeds this graph into a compact latent space, and a low-rank relational field compensates for missing geometric edge information at test time. A diffusion denoiser is trained once using random masking, enabling the same model to operate under different conditioning strengths, from topology-only to fully annotated graphs. Multi-level conditioning combines masked node-level attention with graph-level embeddings to support flexible user control, including layouts specified through natural-language descriptions. Experiments on the 3D-FRONT dataset show that Nestwork achieves high fidelity, structural consistency, and diversity. Controlled ablations further validate the contributions of each component.
Paperid: 3799,   Poster  
Authors: Haoyan WU, Yahao Liu, Yinjie Lei, Lixin Duan, Wen Li
Title: Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models
Abstract: Existing TestTime Adaptation (TTA) methods for Vision-Language Models (VLMs), focusing on designing efficient adaptation parameters (eg. prompts or residual prototypes), predominantly rely on high-confidence samples obtained via entropy-based filtering. However, this prevailing paradigm implicitly inherits the VLM’s class-wise prediction biases and leads to insufficient coverage of the test distribution, rendering the adaptation process biased and insufficiently exploratory.To overcome these limitations, we propose Dynamic Logits Adjustment and Exploration (DLAE), a novel framework that integrates Dynamic Logit Adjustment (DLA) with a Consistency-Guided Exploratory Cache (CGEC). DLA dynamically recalibrates model logits based on test prediction statistics, thereby mitigating class-wise prediction inconsistencies. Different from traditional cache mechanisms, our CGEC actively identifies additional samples near decision boundaries whose predicted labels are sensitive to the logit adjustment, thereby exploring beyond only high-confidence samples. By enforcing semantic and temporal consistency, the cache preserves the reliability of selected samples while enabling cautious yet effective exploration of low-confidence regions, ultimately yielding stable and reliable adaptation.Extensive experiments across multiple vision-language benchmarks demonstrate that our approach consistently surpasses state-of-the-art TTA methods, showing superior stability, adaptability, and generalization.
Paperid: 3800,   Poster  
Authors: Anusha Achaya, Hitesh Sapkota, Qi Yu, Xumin Liu
Title: The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
Abstract: Weakly supervised learning provides a costeffective framework for video anomaly detection by using video-level supervision instead of relying on the costly fine-grained segment-level labels. Although contemporary methods have shown promising results on challenging real-world surveillance videos, most of them are evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC). Our work reveals that a high AUROC could result in a very low recall given a meaningful False Positive Rate (FPR) threshold. Thus, these models suffer from limited practical values, especially in high-stake domains (\eg public safety and medical diagnosis), where missing the true anomalies are highly costly. This surprising phenomenon is rooted in the interplay of weak supervision and the highly imbalanced distribution between normal and abnormal segments. To tackle this key challenge of building practical video anomaly detection systems, we propose a novel dual exploration strategy that combines temporal clustering with uncertainty-based segment exploration. Temporal clustering selects diverse segments based on both semantic and temporal similarity, while uncertainty-based sampling targets low-scoring segments with high model uncertainty. This ensures the model learns from a wide range of patterns, both diverse and ambiguous, resulting in more informed and robust decision-making, and reduction in false negatives. Meanwhile, we recommend two practical metrics to replace the commonly used AUROC score for a more effective measure for evaluation. Experiments conducted in challenging real-world videos demonstrate better dual exploration performance compared to competitive baselines on these metrics, which justifies its improved practical value in real-world settings.
Paperid: 3801,   Poster  
Authors: Shilong Li, Xiurui Xie, Qiugang Zhan, Luochao Wang, Yong Deng, Guisong Liu
Title: Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation
Abstract: The temporal evolution patterns of surface spatial structures constitute a central concern within the field of intelligent remote sensing interpretation.However, constrained by the availability of only two temporal phases, modeling sparse spatiotemporal change processes to effectively interpret surface alterations remains a core challenge in intelligent remote sensing analysis. To address this, this paper proposes SpikeAdapter, a lightweight enhancement framework. This framework comprises Geo-Spike Interpolation (GSI-P), an spiking neural network (SNN) feature extractor, and the spatio-temporal fusion module STSpikeFuse. Inspired by the brain’s perceptual response to new and fading stimuli, the core GSI-P module transforms bi-temporal radiometric differences into sparse spike sequences with time-to-first-spike characteristics.Then we use a feature extractor of SNN to capture dynamic variations of land-surface targets. The STSpkeFuse module employs a learnable temporal decay mechanism to adaptively fuse the SNN features with the semantic representations. This representations are generated by a traditional artificial neural network (ANN) backbone.Extensive experiments on change detection datasets demonstrate that SpikeAdapter effectively enhances temporal awareness and interpretability.
Paperid: 3802,   Poster  
Authors: Jiawei Zhao, Minjie Du, Zihan Qin, Zhuoran Wang, LizheXie LizheXie, Yining HU
Title: VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
Abstract: Large visionlanguage models (LVLMs) have achieved impressive performance across a variety of multimodal tasks, yet remain vulnerable to targeted adversarial attacks, particularly in black-box settings. In this paper, we propose VCP-Attack, a transferable targeted attack framework that combines structured contrastive supervision with subspace-guided perturbation optimization. Specifically, we employ a dynamic PCA-based projection to constrain perturbations within semantically meaningful low-dimensional subspaces, and design a multi-sample contrastive loss to align adversarial features with target semantics while pushing them away from the source semantics. Extensive experiments on seven open-source and three proprietary LVLMs—including GPT-4o, Claude, and Gemini—show that VCP-Attack achieves state-of-the-art performance in black-box targeted attacks. Under a fixed perturbation budget (\epsilon = 16/255), our method achieves an average attack success rate (ASR) of 94.2% on open-source models and 83.1% on proprietary models, surpassing the strongest baselines by 23.3% and 16.8%, respectively. Notably, VCP-Attack achieves a 95.6% ASR on GPT-4o. Comprehensive ablation studies and visualizations further validate the effectiveness of the dynamic subspace projection and semantic contrastive supervision. While evaluated on image captioning, our approach is model-agnostic and exhibits strong potential for broader applications to black-box adversarial settings in vision-language tasks.
Paperid: 3803,   Poster  
Authors: Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu
Title: From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
Abstract: Finetuning Large VisionLanguage Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward leads to minimal learning signals when all candidate responses are failed in challenging scenarios.In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate’s improvement over the initial attempt and converts it into informative shaping signals.These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions.Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Code will be released.
Paperid: 3804,   Poster  
Authors: Jiale Shi, Jiarui Hu, Zesong Yang, Kaixuan Luan, Hujun Bao, Zhaopeng Cui
Title: GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
Abstract: We introduce GaussianZoom, a generative zoomin 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs.To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution.To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.
Paperid: 3805,   Poster  
Authors: Jianming Lv, Chengjun Wang, Liang Depin, Qianli Ma, Wei Chen, Xueqi Cheng
Title: MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping
Abstract: Deploying pretrained visual models in realworld environments often suffers from significant performance degradation due to the diversity of testing scenarios. Continuous adaptation of learning models on edge devices via unlabeled data collected from the target domain is highly effective for boosting generalization capability. However, gradient-backpropagation-based optimization of the massive parameters in deep neural networks is vastly more time-consuming than forward inference, rendering online learning infeasible on low-power edge devices. To address this critical challenge, we propose a lightweight gradient-free forward-memorizing framework, namely MemFlow, which leverages a frozen backbone and enables efficient fine-tuning of the mapping between features and predictions. Specifically, MemFlow employs randomly connected neurons to memorize feature-label associations; within the network, spiking signals are propagated, and predictions are generated by associating neuron-stored memories according to their confidence levels. More notably, MemFlow supports reinforced memorization of feature mappings using unlabeled data, thereby enabling rapid adaptation to new domains. Extensive experiments on four real-world cross-domain datasets demonstrate that MemFlow achieves performance improvements of up to 10% while consuming less than 1% of the computational time required by traditional domain adaptation methods.
Paperid: 3806,   Poster  
Authors: Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu ZHOU
Title: MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
Abstract: Endto-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision–language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition–Perception–Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.
Paperid: 3807,   Poster  
Authors: Yuanjun Tan, Aoran Xiao, Liqian Deng, Zhigang Tu
Title: DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition
Abstract: Human action recognition (HAR) in lowlight environments remains challenging due to degraded visibility, illumination variance, and loss of appearance cues. We introduce DarkAct, a large-scale and high-quality RGB–thermal video dataset purpose-built for multimodal action recognition under low illumination. DarkAct contains 12,778 paired RGB–thermal videos covering 27 human actions across diverse viewpoints and scenes, offering a novel and comprehensive benchmark for understanding human actions in dark environments. We conduct extensive experiments on DarkAct, systematically benchmarking unimodal HAR models, multimodal fusion frameworks, and vision-language foundation models. Their limited performance on DarkAct underscores the urgent need for more robust perception systems under adverse illumination. To address this, we propose DarkAct-Net, an RGB–thermal fusion framework that enhances human-centric representation and achieves adaptive cross-modal fusion, enabling robust and fine-grained action recognition across diverse lighting conditions. All dataset and code will be publicly released.
Paperid: 3808,   Poster  
Authors: Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He
Title: Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
Abstract: VisionLanguage Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic–geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic–Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods.
Paperid: 3809,   Poster  
Authors: Sandro Papais, lezhou feng, Charles Cossette, Lingting Ge
Title: SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
Abstract: Vision Transformers (ViTs) enable strong multiview 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D–3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.
Paperid: 3810,   Poster  
Authors: Yuqiao He, Xiaoyan LIU, Jianxu Mao, Yaonan Wang, Hui Zhang, Lizhu Liu, Yurong Chen, Wenbin He
Title: SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging
Abstract: Coded Aperture Snapshot Spectral Imaging (CASSI) has emerged as a prominent technique for efficient hyperspectral imaging. However, the strong coupling between physical encoding and computational decoding makes CASSI highly sensitive to minor hardware misalignments, which can significantly degrade reconstruction quality. Existing methods either assume ideal imaging conditions, or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanical vibration that cause mask shifts. To address these limitations, we propose a SelfSupervised Geometry Degradation Estimation (SGDE) framework that explicitly models mask misalignments as an affine transformation and embeds it into the imaging model. SGDE jointly estimates affine parameters and reconstructs the hyperspectral image in a self-supervised manner, eliminating the need for calibration targets or device-specific training data. Furthermore, we introduce a multi-kernel estimation strategy to enhance calibration robustness under large perturbations. Extensive experiments on both simulated and real-world datasets demonstrate that SGDE achieves superior robustness against geometric degradations. Moreover, the estimated affine parameters can be directly integrated into existing reconstruction algorithms, enabling plug-and-play calibration for practical CASSI systems.
Paperid: 3811,   Poster  
Authors: Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo
Title: WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories
Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges cameraguided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction.Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training.Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
Paperid: 3812,   Poster  
Authors: jiaqi tan, Xu Zheng, Yang Liu
Title: Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation
Abstract: Multimodal semantic segmentation (MMSS) faces significant challenges in realworld applications due to incomplete, degraded, or missing sensor data. To address this, we propose RobustSeg, an efficient teacher-student framework that enhances model robustness under missing-modality conditions while maintaining strong performance in full-modality scenarios. RobustSeg adopts a feedback-based self-distillation paradigm consisting of two complementary stages. Firstly, we introduce Hybrid Prototype Distillation (HPD), which enables more reliable knowledge transfer of both cross-modal and modality-specific aspects. Concretely, combined with dominant-modality selection, HPD performs cross-modal semantic distillation with high-level semantic prototypes to reduce modality bias. Meanwhile, HPD conducts intra-class feature variation distillation for modality-specific structural details. Secondly, to enable the teacher model to gradually produce more balanced and robust modality representations, we make the student model provide feedback from the non-dominant modality to the teacher, benefiting the entire distillation process. Experiments on three datasets demonstrate that our method achieves state-of-the-art robustness (e.g., +2.40% missing-modality performance on DeLiVER) while causing almost no degradation in full-modality performance (only -0.1% mIoU). Moreover, evaluations using different backbones (AnySeg and CMNeXt) further validate the generalization ability of RobustSeg.
Paperid: 3813,   Poster  
Authors: Jianbin Zhao, Chaoran Feng, Miao Yu, Yingtao Li, Zhenyu Tang, Wangbo Yu, Yian Zhao, Xiaomin Li, Li Yuan, Yonghong Tian
Title: Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
Abstract: Recent progress in textto-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity—particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct theSpatialReward-Datasetwith over 80k preference pairs. Building on this dataset, we buildSpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.% Through rigorous data curation and filtering,SpatialScoresurpasses several leading vision-language models (VLMs) and existing reward models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.
Paperid: 3814,   Poster  
Authors: Junbo Zhang, hangsu hangsu, Zhaofan Li, Hang Dong, Chao Sun
Title: Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
Abstract: Audiovisual segmentation (AVS) aims to accurately segment sounding objects in video frames by leveraging audio-visual correspondence cues. However, it remains challenging due to the intrinsic semantic incompleteness within a single modality and the semantic gap between audio and visual representations. Traditional feature-fusion-based decoding approaches struggle to suppress fusion noise effectively, while recent methods that incorporate data-dependent priors often increase the complexity of modeling audio-visual correlations, leading to poor cross-domain generalization. To address these issues, we propose a novel adaptive contrastive and prototype learning framework, BYOAVP, for AVS. Specifically, we design a Self-Supervised Audio Enhancement (SSAE) module that introduces contrastive learning to adaptively align audio representations with gradient-blocked visual embeddings, thus narrowing the semantic gap between modalities. Furthermore, a Dynamic Prototype Constraint (DPC) module is developed to refine pixel-wise category perception via momentum-based prototype updating, while enhancing the localization of sounding regions through cross-modal interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance across two AVS benchmarks and six sub-tasks, exhibiting strong robustness and generalization ability.
Paperid: 3815,   Poster  
Authors: Haoxiang Hu, Yaokun Li, Zeyuan Huang, Cangjun Gao, Qiang He, Qingkun Li, Xiaoming Deng, Cuixia Ma, Yu-Kun Lai, Yong-Jin Liu, Hongan Wang
Title: Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
Abstract: Diagrams are widely used in daily life. However, offline diagrams typically exist in the form of images, lacking structured data representation, which significantly limits their reusability and editability. Current research mainly focuses on supporting basic query tasks for online diagrams and does not meet the semantic understanding and interaction requirements for complex offline diagrams. Although large language models (LLMs) possess powerful reasoning and knowledge integration capabilities, their performance in processing offline diagrams is unsatisfactory due to the inability to accurately understand the structure and content of offline diagrams. To address these issues,we propose DiagramDiff, a framework consisting of a highprecision diagram reconstruction model and an instance-level diagram element recognition model. The framework converts offline diagrams into standardized data structures, enabling LLMs to transition from being unable to understand offline diagrams to becoming intelligent assistants capable of performing tasks such as semantic reasoning, logical validation, and efficient diagram editing. We have constructed a dataset containing diagrams and their corresponding question and answering(Q&A) and editing tasks. Experiments demonstrate that DiagramDiff achieves state-of-the-art performance in diagram reconstruction and recognition tasks, significantly enhancing LLMs' understanding and interaction capabilities with offline diagrams.
Paperid: 3816,   Poster  
Authors: Xiaoyu Kong, Ketong Ren, Dongyu She, Weiming Dong, Miao Wang
Title: Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance
Abstract: Video amodal completion (VAC) aims to mimic the human brain's ability to implicitly perceive the complete appearance of partially occluded objects, thereby facilitating recognition and understanding. Existing VAC methods finetune video generation models on custom datasets, yet these datasets often have unrealistic distributions and small scales due to the challenges of collecting real amodal data and thus limit their performance and generalization.To address this, we utilize pretrained image inpainting models for VAC and introduce in-context (IC) learning to enhance inter-frame consistency. However, despite the satisfactory performance of DiT-based IC Learning in generation tasks, task-agnostic global information often utilizes irrelevant scene information, resulting in completion failures when applied to amodal completion task. Additionally, IC Learning faces a cold-start problem with the exemplar construction. To this end, we propose a consistency video amodal completion with rectified in-context exemplar guidance. Specifically, we introduce rectified exemplar-guided completion by adjusting the attention weights of exemplar image relative to the target images for consistent completion, and adopt a dual-frame calibrated exemplar rectification to tackle the cold-start issue.Quantitative and qualitative experiments demonstrate that our method outperforms SOTAs, especially in terms of generalization and robustness on uncommon data and under severe occlusion.
Paperid: 3817,   Poster  
Authors: Mostofa Uddin Uddin, HM Shadman Tabib, Thanh-Huy Nguyen, Kashish Gandhi, Min Xu
Title: Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
Abstract: We introduce an unsupervised approach for segmenting multiscale subcellular objects in 3D volumetric cryoelectron tomography (cryo-ET) images. To this end, we address key challenges such as lack of annotated data, large data volumes, high heterogeneity of subcellular shapes and sizes, and high inter-domain variability of cellular cryo-ET images across different experiments and contexts. Our method requires users to only select a small number of slabs from a few representative tomograms in the dataset. The core of our method is extracting features for the corresponding slabs, leveraging a Stable Diffusion foundation model pretrained on mostly natural images. The feature extraction is followed by a novel heuristic-based feature aggregation strategy, and adaptive thresholding to segment the aggregated features. The resulting masks are refined with pretrained CellPose to split composite regions, and then utilized as pseudo-ground truth for training supervised deep learning models. We validated our unsupervised foundation-model based pipeline on publicly available cryo-ET benchmark datasets, demonstrating performance that closely approximates expert human annotations. This fully automated, data-driven framework enables the mining of multi-scale subcellular patterns, paving the way for accelerated biological discoveries from large-scale cellular cryo-ET datasets.
Paperid: 3818,   Poster  
Authors: Yiwei Wei, Zhengliang Guo, Shaozu Yuan, Chengyin Hu, Zhiyang Jia, Jiujiang Guo, Meng Chen, Peiying Wang, Longbiao Wang
Title: Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification
Abstract: Hateful meme classification aims to identify memes containing hateful content and has become increasingly important in the era of social media dominance. Large multimodal models (LMMs) have significantly enhanced the understanding of multimodal content, advancing this field. However, cognitive biases in LMMs can impede effective collaboration among models. To address this issue, we introduce GECO, a Gametheoretic multi-agEnt Collaboration framewOrk that organizes multiple LMMs into interacting agents and employs game-theoretic principles to guide them toward an optimal cooperative equilibrium. GECO integrates a mixed bonus scheme, incorporating both individual accuracy and cross-model agreement to promote convergence toward a consistent cooperative solution. In addition, we implement efficient policy learning and introduce a penalty coefficient to optimize the framework effectively and ensure training stability. Extensive experiments on five publicly available benchmarks demonstrate that our framework achieves new state-of-the-art performance.
Paperid: 3819,   Poster  
Authors: He Zhu, Xiaotong Huang, Zihan Liu, Weikai Lin, Xiaohong Liu, Zhezhi He, Jingwen Leng, Minyi Guo, Yu Feng
Title: SeeLe: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
Abstract: 3D Gaussian Splatting (3DGS) has become a crucial rendering technique for many realtime applications. However, the limited hardware resources on today's mobile platforms hinder these applications, as they struggle to achieve real-time performance. In this paper, we propose Seele, a general framework designed to accelerate the 3DGS pipeline for resource-constrained mobile devices.Specifically, we propose two GPU-oriented techniques: hybrid preprocessing and contribution-aware rasterization.Hybrid preprocessing alleviates the GPU compute and memory pressure by reducing the number of irrelevant Gaussians during rendering.The key is to combine our view-dependent scene representation with online filtering. Meanwhile, contribution-aware rasterization improves the GPU utilization at the rasterization stage by prioritizing Gaussians with high contributions while reducing computations for those with low contributions.Both techniques can be seamlessly integrated into existing 3DGS pipelines with minimal fine-tuning.Collectively, our framework achieves up to 6.3× speedup and 39.1% model reduction while achieving superior rendering quality compared to existing methods.Our codes will be released upon publication.
Paperid: 3820,   Poster  
Authors: Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang
Title: VVS: Accelerating Speculative Decoding for Visual Autoregressive Model via Partial Verification Skipping
Abstract: Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their nexttoken-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step'' paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage’s characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired of these two observations, we propose a novel SD framework VVS to accelerate \underline\textvisual AR model via partial \underline\textverification \underline\textskipping, which integrates three complementary modules: (1) a verification-free token selector with dynamically truncation, (2) token level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of 2.8× relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed–quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm. Our code will be publicly available upon acceptance of this paper.
Paperid: 3821,   Poster  
Authors: Xin Niu, Manqi Zhao, Dongsheng Jiang, Yingying Wu, Bing Su
Title: ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP
Abstract: Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Openvocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP’s original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise predictions. As a result, the quality of attention maps—particularly those generated in the final transformer layers—plays a pivotal role in guiding inter-region interactions. However, current methods generate suboptimal representations when capturing the complex spatial hierarchies in remote sensing. We address this gap by optimizing CLIP's 197×197 attention matrix through three key modifications: (1) substituting the 196×196 patch-to-patch submatrix with intermediate-layer feature similarities to preserve spatial structures; (2) prioritizing intermediate-layer attention for global-to-local (class-to-patch) token alignment to reduce classification interference; (3) disabling the \texttt[CLS] token's self-attention to mitigate bias. Extensive experiments on eight remote sensing benchmarks and two building/road extraction datasets demonstrate that our method achieves state-of-the-art performance among existing training-free approaches.
Paperid: 3822,   Poster  
Authors: Shihao Hou, Chikai Shang, Zhiheng Yang, jiacheng yang, Xinyi Shang, Junlong Gao, Yiqun Zhang, Yang Lu
Title: Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
Abstract: Personalized federated learning (PFL) with foundation models has emerged as a promising paradigm enabling clients to adapt to heterogeneous data distributions. However, realworld scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) Fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) Conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. To address these challenges, we propose Federated Learning via Gradient Purification and Residual Learning (FedPuReL), which preserves balanced knowledge in the global model while enabling unbiased personalization. Specifically, we purify local gradients using zero-shot predictions to maintain a class-balanced global model, and model personalization as residual corrections atop the frozen global model. Extensive experiments demonstrate that FedPuReL consistently outperforms state-of-the-art methods, achieving superior performance on both global and personalized models across diverse long-tailed scenarios. The code is available in the supplementary materials.
Paperid: 3823,   Poster  
Authors: Guohui Zhang, Fuming Sun, Yu Zhao, Yuqiu Kong, Jing Sun, Ganggang Huang
Title: Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
Abstract: OpenVocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects from unseen categories under textual guidance precisely. However, existing methods often employ a unidirectional interaction strategy, where textual prompts guide the matching of visual features. Such a design neglects the bidirectional interaction between visual and language modalities, making the model vulnerable to the semantic gap between image-level textual semantics and pixel-level segmentation cues, which in turn leads to severe semantic confusion in complex camouflaged scenarios. To address this challenge, we propose BaCLIP, a novel bidirectional semantic alignment framework for OVCOS. At its core lies the Mutual Refinement and Enhancement Module (MREM), which establishes bidirectional cross-attention between visual and textual features, enabling mutual semantic calibration to resolve ambiguity and strengthen cross-modal alignment. Moreover, we introduce an Adaptive Prompt that transforms refined textual embeddings into semantic-aware prompts for Segment Anything Model (SAM), enabling direct textual guidance and improving mask precision. Experimental results on the OVCamo benchmark demonstrate that BaCLIP consistently achieves state-of-the-art performance with a compact architecture, effectively mitigating semantic confusion and advancing the understanding of cross-modal camouflage perception.
Paperid: 3824,   Poster  
Authors: Tao Xie, Tao An, Feng Liu, Jin Wensheng, Zhengyu Li, lijun zhao, Ruifeng Li
Title: H$^{2}$A$^{2}$: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection
Abstract: In this work, we observe that for indoor 3D object detection, fundamental geometric cues induce homogeneous spatial responses across scenes, whereas scenespecific structure yields heterogeneous signatures. However, existing detectors lack effective mechanisms to jointly extract and exploit such dual properties, which imposes inherent limitations on detection performance. Guided by this insight, we propose H^2A^2, a homogeneity-aware and heterogeneity-aware feature perception network for unified indoor 3D object detection under cross-scene training paradigms.Technically, we introduce a structural-feature-aware kernel selection (SF-KS) method, which encompasses three core components:(i) task-aware linear modulation, a channel-wise affine transformation that strengthens scene-structural feature representation; (ii) kernel weight selection strategy that integrates an offset validity prior to suppress non-informative cross-scene transfer while utilizing a structural consistency posterior to capture scene-homogeneous cues. and (iii) task-aware channel gating that suppresses scene-irrelevant feature responses. Overall, SF-KS enables the precise optimization of homogeneous features while specializing in scene-specific heterogeneous ones. In addition, to stabilize cross-scene optimization, we further introduce norm-based gradient homogenization (NGH) algorithm, which normalizes and dynamically reweights per-task gradient norms to mitigate conflicts and promote consistent updates. Extensive experiments on diverse indoor benchmarks show that H^2A^2 delivers consistent gains over strong baselines and improves cross-scene generalization.
Paperid: 3825,   Poster  
Authors: Yidan Wang, Zongheng Wang, Hong-Jie Xing, Chun-Guo Li, Xiaoxiao Liu
Title: Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis
Abstract: Multimodal sentiment analysis (MSA) aims to identify human emotions through multimodal data. Despite considerable advances in MSA, we find that emotional class centers often overlap when integrating data from different modalities into the same representation space. In this paper, we propose a novel MultiMetric Representation learning strategy based on clustering (MMRest) to alleviate this issue through flexible multi-metric representation learning, enabling the model to learn fine-grained sentiments. Specifically, we first design a module termed Multi-metric Multimodal learning on Clusters (MMC), which minimizes distances within similar sentiment pairs while maximizing dissimilar ones, aiming to learn a global metric and local metrics in each cluster from multimodal data. Afterwards, we develop a Projection and Decision-Level Fusion (PDLF) module, including two parts. One part utilizes the optimal global and local metrics to obtain a projection value. The other part combines the projection value with an intermediate score which is obtained through the fusion of unimodal and multimodal representations to obtain the final sentiment prediction score. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method is significantly superior to state-of-the-art methods on various evaluation indicators and parameter count, by effectively learning fine-grained emotional boundaries. The code will be made open-source if the paper is accepted.
Paperid: 3826,   Poster  
Authors: Huafeng Chen, Chenguang Zhu, Yueming Lyu, Caifeng Shan
Title: Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection
Abstract: Most Camouflaged Object Detection (COD) methods rely on costly pixellevel annotations. Recent studies have adopted unsupervised COD (UCOD) to eliminate labeling costs, but still suffer from two issues:1) insufficient supervision, leading to reliance on self-supervised backbone DINO and reduced model flexibility; and 2) ineffective use of pseudo-labels, which widens the performance gap with supervised methods and limits real-world applicability. In this paper, we propose a novel teacher-student framework for UCOD to address these two issues. To tackle the lack of supervision, we build a powerful teacher model by integrating Multimodal Large Language Models (MLLMs) and the Segment Anything Model (SAM) to generate high-quality pseudo-labels. However, the teacher model faces two challenges: 1) suboptimal performance of MLLMs in COD, and 2) cascading errors.To address these challenges, we first propose a Camouflaged-Aware Chain-of-Thought (CA-CoT) for MLLMs. CA-CoT guides MLLMs through step-by-step reasoning to simulate human perceptual processes, thereby enhancing their performance in COD.Subsequently, we design a Graded Mask Evaluator (GME) to mitigate cascading errors, which evaluates and grades the quality of masks generated by SAM, and then filters out the low-quality masks to provide more reliable supervision.To better leverage pseudo-labels, we propose Graded Knowledge Distillation (GKD), which adaptively enhances distillation at both image and pixel levels based on pseudo-label quality.Extensive experiments show that our method outperforms existing UCOD approaches by a large margin and achieves performance comparable to weakly supervised methods. Notably, our method also achieves good performance under zero-shot settings.
Paperid: 3827,   Poster  
Authors: Stephen Price, Danielle Cote, Elke Rundensteiner
Title: PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
Abstract: Highquality segmentations are critical in vision tasks where boundary accuracy is important (e.g., medical diagnostics, quality control, etc.). Recently, promptable vision models have emerged as effective backbones for segmentation refinement frameworks. However, their performance not only hinges on prompt quality, they also must overcome noisy input masks and semantically ambiguous outputs from promptable models. Existing prompt-based refiners rely on fixed prompt rules, making them brittle to changing failure modes and new tasks or domains. We propose \MOE, a model-agnostic MoE-driven prompting refiner effective in segmentation refinement across tasks and domains. \MOE features three collaborative modules to refine an initial mask: our MoE-based Image-Informed Prompting framework (IIP) takes an image and coarse mask and produces a set of expert score maps to guide prompt generation, the Dynamic Expert Selector (DES) activates only the most relevant experts and fuses their maps to avoid dense evaluation and signal dilution, and the Prompt-Placement Explorer (PPE) explores the fused guidance map to place high-confidence spatially diverse point prompts. Across five benchmark datasets (BIG, VOC, DAVIS585, ECSSD, MSRA-B), \MOE achieves statistically significant gains over SOTA methods CascadePSP, SegRefiner, and SAMRefiner on semantic, instance, and salient tasks, with mean improvements of +6.24 IoU / +8.99 BIoU.
Paperid: 3828,   Poster  
Authors: Sungyong Park, Sooyoung Choi, Hyunseo Koh, Youngjae Choi, Heewon Kim
Title: CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation
Abstract: The reliability of autonomous systems in realworld environments is mainly dependent on the robustness of their visual perception.Although recent studies have advanced the handling of visual degradations, physical contaminants that adhere to the camera lens—such as mud, water droplets, and condensation—remain largely underexplored.To this end, we introduce the CLP (Contaminated Lens Protector) dataset, a real-world benchmark designed to evaluate perception performance under realistic lens-protector contamination.The CLP dataset offers degraded images across multiple types of contamination and various lens-to-protector distances, along with dense semantic segmentation masks and aligned restoration targets.This dataset enables robust segmentation and restoration studies in conditions that closely match those encountered by real-world autonomous systems.Experiments analyze strategies to improve perception under contamination with limited data, highlighting the importance of domain generalization, foundation models, data scale, and joint restoration-segmentation pipelines.
Paperid: 3829,   Poster  
Authors: Idan Yankelev, Edita Grolman, Yarin Levi, Amit Giloni, Omer Hofman, Toshiya Shimizu, Yuval Elovici, Asaf Shabtai
Title: AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
Abstract: Adversarial patch attacks pose a significant threat to the reliability of object detection (OD) models, particularly in realtime security applications. Although several defenses have been proposed, they often suffer from two limitations: 1) reduced performance on benign images, and 2) impractical processing time for real-time OD applications. In this paper, we present AntiStyler, a novel and rapid defense against adversarial patches. Given an input image, AntiStyler identifies and masks pixels that exhibit a ``random'' style associated with adversarial attacks and uses a series of spatial filters to enhance the mask and remove unwanted noise, efficiently masking adversarial patches. AntiStyler features model-, patch-, and attack-agnostic capabilities and does not require any training, making it a fully agnostic zero-shot defense against adversarial patch attacks. Our evaluation on the COCO, INRIA, Superstore, and APRICOT datasets, with both digital and physical attacks, demonstrates AntiStyler's state-of-the-art robustness (improving adversarial performance by 8-15 mAP%) without compromising the original performance on benign images. Additionally, unlike most existing defenses, AntiStyler can process 10-12 frames per second (FPS), making it efficient and relevant for real-time OD applications.
Paperid: 3830,   Poster  
Authors: Peng Ren, Cheng Jiang, Chuande Yang, Fuming Sun, Tian Bai
Title: Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt
Abstract: VisionLanguage models (e.g., CLIP) facilitate the development of open-vocabulary camouflaged object segmentation (OVCOS), but existing methods still rely on mask annotations for fully-supervised training. In contrast, the training-free paradigm can rapidly process unseen data, representing a highly promising solution. However, in camouflage scenarios, existing training-free methods utilize sparse textual prompts and ignore the category similarity between visual patches, leading to inadequate object binding capability. To alleviate these issues, we propose a fine-grained object binding and adaptive hybrid prompt framework for training-free OVCOS. The framework first employs multimodal large language models (MLLMs) to explicitly model fine-grained textual descriptions of camouflaged objects and background. Building on this, we construct a semantic probe to decouple object and background features and explicitly model category similarity between visual patches via semantic consistency ranking, thereby achieving accurate object binding. Subsequently, we propose an entropy-guided text embedding adjustment strategy to adjust textual embeddings, aiming to further enhance fine-grained object binding. Finally, we utilize an adaptive hybrid prompt generation strategy to generate hybrid prompts, assisting SAM in accurately segmenting camouflaged objects. Experimental results on the OVCamo benchmark demonstrate that our method achieves excellent performance, significantly surpassing the advanced training-free ResCLIP.
Paperid: 3831,   Poster  
Authors: Qiang Hu, Jiajie Wei, Zhenyu Yi, Zhifen Yan, Yingjie Guo, Hongkuan Shi, Ge-Peng Ji, Qiang Li, Zhiwei Wang
Title: SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
Abstract: Mixsupervised image segmentation aims to effectively leverage heterogeneous annotations. Recent prompt-based advances utilize foundation models such as Segment Anything Model (SAM) to generate pseudo-masks by treating weak labels as spatial prompts. However, these methods rely heavily on sparse spatial priors, leading to suboptimal performance in ambiguous regions and overlooking the potential of unlabeled data due to the absence of promptable cues. In this paper, we propose SAMIX, a novel framework that adapts SAM2 into a semantic-aware pseudo-label generator SA-SAM2 by incorporating a lightweight semantic adapter. Beyond being guided by sparse spatial prompts, SA-SAM2 facilitates dense contextual prompts provided by valuable image–mask reference pairs with shared semantics. This design allows SAMIX to produce high-quality pseudo-masks even for ambiguous objects with sparse or no annotations. Another core component of SAMIX is the Selecting Policy Network (SPNet), which auto-regressively retrieves relevant and complementary reference samples for each query image. Unlike rule-based selections, SPNet is trained via reinforcement learning to actively explore reference combinations that maximize pseudo-label quality. Guided by customized and verifiable rewards associated with mask quality, the selection toward semantically informative and diverse contexts. We conduct extensive experiments on two general datasets (PASCAL VOC 2012 and Cityscapes) and two challenging specific datasets with ambiguous boundaries (camouflaged object detection and image polyp segmentation). Across diverse mix-supervision settings, SAMIX consistently achieves state-of-the-art performance, effectively leveraging both weakly labeled and unlabeled data. Codes will be released upon publication.
Paperid: 3832,   Poster  
Authors: Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos
Title: HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Abstract: In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, highquality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions.
Paperid: 3833,   Poster  
Authors: Yanran Zhang, Wenzhao Zheng, Yifei Li, Bingyao Yu, Yu Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Title: UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
Abstract: In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we proposeUniGenDet: a Unified generativediscriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance.
Paperid: 3834,   Poster  
Authors: Ke Fan, Jiangning Zhang, Ran Yi, Jingyu Gong, Yabiao Wang, yating wang, Xin Tan, Chengjie Wang, Lizhuang Ma
Title: Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation
Abstract: Textto-motion generation is a fundamental task in computer vision, aiming to synthesize 3D human motion sequences from natural language descriptions. However, due to the limited scale and diversity of existing datasets, models trained to directly map raw text to motion often struggle to generalize to out-of-domain textual inputs. We observe that although high-level motion semantics vary widely, many motions share a common set of underlying atomic motions—that is, simple, reusable body-part movements. Building on this insight, we introduce anAtomic Motion Decomposition and Recompositionframework for open-vocabulary text-to-motion generation. Our approach consists of two key components: aTextual Decompositionmodule that parses out-of-domain descriptions into atomic motion units, and anAtomic Recompositionmodule that integrates these units to produce the final motion sequence. Our model achieves a competitive performance on the in-domain HumanML3D dataset, and extensive experiments on two out-of-domain datasets (IDEA400 and Mixamo) demonstrate that our method substantially outperforms state-of-the-art approaches in open-vocabulary motion generation.
Paperid: 3835,   Poster  
Authors: Zixuan Duan, Zeyu Zhang, Fengyuan Lu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao
Title: SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
Abstract: Existing Incremental Learning (IL) methods are primarily evaluated under either a singledomain class-incremental setting, or a multi-domain task-incremental setting with known task identifiers. However, these assumptions often fail to hold in real-world applications. To bridge this gap, we introduce Heterogeneous Incremental Learning (HIL), a new setting for evaluating IL methods under realistic and challenging conditions, where task boundaries are ambiguous or unknown, class distributions shift dynamically across environments, and training data is limited. Model editing is inherently well-suited for this challenging HIL, as it allows for the efficient integration of new knowledge while preserving model capabilities. Thus, we propose a novel Sparse and Anchored Model Editing (SAME) for addressing HIL. Specifically, SAME sparsely and selectively updates task-relevant model parameters to extract compact, task-specific key–value knowledge pairs from limited data. Using these task knowledge pairs, the model performs knowledge injection for new tasks under double-anchor constraints. The knowledge anchor aligns the updated and original model features, while the parameter anchor constrains parameter magnitudes, ensuring stable and consistent knowledge injection. Our method can efficiently solve HIL using only a few labeled examples, without introducing additional model parameters. Extensive experiments on 11 diverse visual-language datasets across 22 sequential tasks show that our method outperforms existing continual learning approaches by 6.8% in average accuracy, while retaining 95.8% of the oracle model performance, demonstrating strong stability and cross-domain generalization.
Paperid: 3836,   Poster  
Authors: Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu
Title: $\phi$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Abstract: Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or \phiDPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new \phi-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable \phi-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed \phi-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.
Paperid: 3837,   Poster  
Authors: Hongyi Cai, HONGYI CAI, MingKang Dong, Muxin Pu, Moayad Aloqaily, jie li, Xinfeng Li, Jialie Shen, Meikang Qiu, Qingsong Wen
Title: AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models
Abstract: Textto-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberate and subtle injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack vectors. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor attack scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, while preserving the visual fidelity of the original model.
Paperid: 3838,   Poster  
Authors: Mengting Xu, Shi Gu, Peng Lin, De Ma, Huajin Tang, Qian Zheng, Gang Pan
Title: On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
Abstract: As the third generation of neural networks, Spiking Neural Networks (SNNs) have demonstrated remarkable potential across diverse applications owing to their unique temporal dynamics. In recent years, analyzing the robustness of SNNs from a temporal perspective has become an emerging research focus. However, most existing works examine only the overall temporal behavior of SNNs, typically applying adversarial attacks that rely on timeaveraged gradients.In this study, we revisit SNN robustness through the lens of temporal granularity, emphasizing the distinct behaviors that occur at individual time steps. We first introduce a Temporal Granularity Attack (TG-Attack), which selectively perturbs gradients at specific time steps. This approach enables a finer-grained evaluation of SNN robustness across time and demonstrates higher attack success rates than traditional gradient-averaging methods.Furthermore, we theoretically show that the robustness of SNNs at a given time step is determined by the Hessian of the input–output gradient at that step, which we define as Temporal Sensitivity (TS). By calculating the Temporal Sensitivity Value (TSV) for each time step, robustness can be effectively estimated without generating adversarial examples. Finally, we propose a Temporal Granularity Regularization (TG-Reg) term that constrains the TSV across all time steps, thereby improving the model’s overall robustness. Experimental evaluations confirm that our framework consistently outperforms existing state-of-the-art methods.
Paperid: 3839,   Poster  
Authors: Xutao Sun, Jiarui Li, 刘俊文 刘俊文, Yonggong Ren
Title: GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
Abstract: Recently, the Vision Mamba architecture has emerged as a promising paradigm for medical image segmentation. However, representation discrepancies often arise between anatomical structures and their associated tissue types, while crucial diagnostic cues tend to be spatially entangled, constraining the performance of Mamba architectures in this domain. To address these limitations, we propose GeoSemba, a novel Mambabased segmentation framework that unifies geometric–semantic and spatial–channel representations. Specifically, we reformulate the Mamba’s state-space equations with two key components, a Semantic-guided State Refiner (SSR) and a Cross-dimensional Affinity Refiner (CAR). SSR reconstructs information flow within an abstract semantic space to forge a synergistic representation between anatomical textures and geometric contours. Concurrently, CAR adaptively models spatial–channel affinities to capture the intrinsic tissue heterogeneity common in medical imaging. By jointly integrating SSR and CAR in a complementary manner, GeoSemba only requires a single scan to effectively achieve cross-dimensional consistency and cross-level interaction. Extensive experiments on public datasets spanning six medical imaging modalities demonstrate that GeoSemba consistently delivers superior segmentation accuracy while maintaining high computational efficiency.
Paperid: 3840,   Poster  
Authors: Jiahao Li, Shiqi Yin, Zhenxiang Lian, jingtao guo
Title: Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal
Abstract: Existing eyeglasses removal methods primarily focus on opaque or fully transparent lenses. However, when dealing with semitransparent sunglasses, these methods often corrupt the visible facial details beneath the lenses, thereby degrading the performance of downstream vision tasks. To address this issue, we propose Diff-SemiER, a novel diffusion-based framework for semi-transparent eyeglasses removal that leverages generative priors and transparency-aware adaptive fusion. The proposed framework fully utilizes the visible eye-region information beneath the lenses while retaining sufficient generative flexibility, thereby striking a balance between generation and restoration within semi-transparent regions. Specifically, Diff-SemiER comprises two diffusion branches: the Generative Prior Diffusion Model (GPDM) generates high-quality eyeglass-free facial images via image inpainting, which provides global semantic guidance for highly occluded scenarios. The Transparency-Aware Adaptive Fusion Diffusion Model (TAFDM) employs a Soft Mask-Aware Adaptive Fusion (SMAF) mechanism to adaptively merge generative and restorative features across multiple scales, enabling dynamic trade-offs between generative capability and fine-detail preservation under varying occlusion levels. Furthermore, we design a transmittance-based data synthesis method to construct a large-scale, high-quality dataset of faces with semi-transparent eyeglasses for model training and evaluation. Extensive experimental results demonstrate that Diff-SemiER significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.
Paperid: 3841,   Poster  
Authors: Yexing Xu, Wei Feng, Shen Zhang, Haohan Wang, Yuxin Qin, Yaoyu Li, Ao Ma, Yuhao Luo, Lu Wang, Xudong Ren, Haoran Wang, Run Ling, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Longguang Wang, Yulan Guo
Title: Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Abstract: Generating realistic and userpreferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. The dataset and codes will be released after acceptance.
Paperid: 3842,   Poster  
Authors: Xing Xi, Yu Qiu, Ronghua Luo, Peixian Chen, peilin tong
Title: Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling
Abstract: Chainof-Thought (CoT) is a key technique for enhancing the reasoning capabilities of Vision Language Models (VLMs). Existing methods often employ Reinforcement Learning (RL) with external constraints to align the model's reasoning process with human cognitive patterns. However, we argue that the model's intrinsic reasoning paths may differ from human cognition, and that forcing such alignment can constrain the model's potential and even degrade its performance. To address this, we propose leveraging the model's intrinsic self-evaluation to guide its optimization. We hypothesize that a model's self-generated confidence scores are effective indicators of its reasoning quality. Based on this evaluation metric, we design two novel reward functions: (1) Sequential Confidence Rigorous Evaluation (SCRE) for challenging problems that demand strict logical reasoning, and (2) Intra-group Score Re-ranking (IGSR) for general-purpose, open-ended scenarios. We name our method Video-RAISE (Reasoning Alignment through Intrinsic Self-Evaluation). Comprehensive experiments on six video understanding benchmarks demonstrate that Video-RAISE achieves state-of-the-art (SOTA) performance, significantly outperforming previous methods and even proprietary models, e,g. GPT-4o and Gemini 1.5 pro. For instance, on the VideoMMMU benchmark, our Video-RAISE achieves a new SOTA accuracy of 52.8%, outperforming the previous best model by a significant 3.0%. In addition, our method achieves a reasoning path consistency of 90%, which is double that of the Qwen2.5-VL-Instruct and even surpasses the performance of supervised fine-tuning. Code will be released publicly.
Paperid: 3843,   Poster  
Authors: Aobo Li, Jinjian Wu, Yongxu Liu, Jupo Ma, Weisheng Dong
Title: Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective
Abstract: As imaging scenarios diversify rapidly, Image Quality Assessment (IQA) faces a key challenge: how to effectively transfer perceptual knowledge from existing annotated datasets to ensure reliable quality prediction in new scenarios. However, current IQA models struggle to generalize. Direct transfer often leads to severe performance degradation, while multidataset joint training rarely yields stable gains and can even harm target performance. We identify the root cause as inconsistent perceptual preference structures across datasets, where models trained on different sources rely on distinct perceptual cues, leading to mismatched conditional distributions P(Y|X) that fundamentally limit transferability.To address this, we propose Perceptual Preference Representation (PPR), which quantifies dataset-specific perceptual preference structures by analyzing correlations between visual features and quality scores. PPR enables training-free assessment of cross-dataset perceptual preference consistency, offering a systematic and interpretable way to analyze transferability.Building on this, we develop Preference-Structure-Aligned Transfer (PreSTA), which iteratively selects samples whose perceptual preferences align with the target domain. Across both cross- and within-domain scenarios, PreSTA achieves superior transfer performance with only a small fraction of data. In the targeted joint transfer setting, PreSTA consistently attains better performance with only a limited portion of the combined data. These results demonstrate that aligning perceptual preference structures, rather than simply increasing dataset size, is the key to effective knowledge transfer in IQA.
Paperid: 3844,   Poster  
Authors: Junrong Lian, Weijian Deng, Pengxu Wei, Yaqin Chen, Qixiang Ye, Liang Lin
Title: When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
Abstract: This work studies how latent space structure impacts the performance of Latent Diffusion Models (LDMs). We show that effective generation requires a latent space that is simultaneously locally smooth, enabling stable and reliable reconstruction, and globally dispersive, allowing the model to draw diverse and meaningful samples without collapsing into narrow regions of the latent space. However, existing approaches often emphasize smoothness, which may lead to concentrated latent regions and limited exploration of the broader space. To address these limitations, we propose SelfOrganized Representation Learning (SORL), a bottom-up training paradigm inspired by self-organization in complex systems, where global structure emerges naturally from simple local interactions. The critical latent properties of smoothness and maximal dispersity are not explicitly imposed. Instead, SORL promotes these properties through two complementary local mechanisms: local attraction, which encourages coherent reconstructions among nearby latent codes, and local repulsion, which prevents latent codes from collapsing into dense clusters. Through their interaction, SORL induces a latent manifold that maintains both local smoothness and global dispersity, leading to improved reconstruction and generation.
Paperid: 3845,   Poster  
Authors: Xiwen Wang, Shichao Zhang, Ruowei Wang, mao li, Chenyu Zhou, Ji-Zhe Zhou, Qijun Zhao, Hailun Zhang
Title: Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
Abstract: Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3Dprinted objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details. We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers. The code will be fully available.
Paperid: 3846,   Poster  
Authors: Seongmin Kim, Byung Cheol Song
Title: Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
Abstract: The Vision Transformer (ViT) has surpassed Convolutional Neural Networks (CNNs) in performance, becoming the de facto architecture in modern computer vision. However, despite its superior representational capacity, research on the adversarial robustness of ViTs remains limited, with most studies still biased toward CNNbased models. This work aims to address this architectural bias and conduct an in-depth analysis of the interaction between ViTs and adversarial training (AT).We first show that ViTs can identify semantic components of objects through their class attention maps, indicating that adversarially trained ViTs inherently encode strong semantic priors. Next, using the proposed Gradient Path Masking (GPM) analysis, we examine the internal information flow of ViTs and verify that the residual path serves as a major bottleneck that provides advantageous information to adversaries. Furthermore, our inter-patch relation analysis reveals that adversarially trained ViTs tend to rely more on global than local relationships in early layers—a novel observation suggesting a %fundamentalpotential incompatibility between ViTs and hybrid architectures that inject CNN-style inductive biases.Building upon these findings, we design a simple yet effective two-stage AT scheme to mitigate this structural incompatibility, achieving simultaneous improvements in robustness and generalization across various ViT variants and training methods. The proposed method is compatible with a wide range of AT frameworks and models.
Paperid: 3847,   Poster  
Authors: Peibo Song, Xiaotian Xue, Jinshuo Zhang, zihao wang, Jinhua liu, Shujun Fu, Fangxun Bao, Si Yong Yeo
Title: Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
Abstract: Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (UniEncoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine- grained structure capture, cross-modal complementarity modeling, and effective exploitation of available modalities. The key idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods for brain tumor segmentation with missing modalities. Code will be released.
Paperid: 3848,   Poster  
Authors: Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang
Title: Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Abstract: Fewstep generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results.Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation.Compared to the limited class labels, text conditions pose greater challenges to the model’s understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance.To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability.This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Based on this insight, we propose a novel auxiliary loss design to learn the discriminative text representation space, achieving an effective adaptation of MeanFlow to text-to-image generation for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation.
Paperid: 3849,   Poster  
Authors: Charantej Reddy Pochimireddy, Subhasmita Sahoo, Apoorva Verma, Palavalli Shyam, Swapnil Malviya, Sarvesh Sarvesh, Raj Narayana Gadde
Title: Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
Abstract: Recent advancements in deep neural networks (DNNs) have significantly improved visual quality of camera captures under lowlight (<10lx) conditions, yet visual quality in extreme low-light (<1lx) remains inadequate. Existing DNN models are computationally intensive and suffer from large processing times, making them impractical for real-time enhancement of high-resolution video. Consequently, Ultra HD (UHD) videos (4K/8K) captured in extreme low-light environments exhibit elevated noise and diminished detail. Developing DNN-based solutions for UHD video enhancement faces challenges including paired dataset creation, temporal consistency, and efficient deployment under strict latency (<33ms) and power constraints (<250mA for 30fps video).We present a comprehensive methodology for developing a real-time raw to raw denoising solution for UHD video in extreme low-light, designed for seamless integration into existing ISP pipelines. Unlike ISP-replacement approaches, our solution enhances commercial camera stacks across sensor platforms. Our framework comprises: (1) Diverse dataset creation methodology; (2) A low-complexity model architecture optimized for mobile compute elements; (3) Efficient training and post-training optimizations (reparameterization, restructuring, quantization) to meet latency constraints while ensuring high-quality output. The result is a power-efficient real-time raw to raw video denoiser that improves extreme low-light video quality while preserving downstream ISP behavior.
Paperid: 3850,   Poster  
Authors: Wenyu Sun, Hufei Li, Ruijin Jin, Xiangheng Kong, Yuning Jiang
Title: Rosetta Stone For Unified MLLMs: A unified tokenizer to decipher understanding and generation
Abstract: Major stateof-the-art unified tokenizers predominantly adopt pixel reconstruction and feature alignment as pretext tasks, they leave key domains largely unexplored such as architecture, supervised objectives and tasks interaction, potentially resulting in limited performance. We systematically investigate the critical factors of a unified visual tokenizer and propose a novel framework that strengthens synergy between understanding and generation in various aspects. Our initial analysis focus on properties of frontier vision models, confirming inherent conflict in contrastive learning style models for unifying generation and understanding, and demonstrate distinct convergence behavior of codebooks. To address the above bottleneck, we hierarchically decouple the conflicting proxy tasks, enriching the diversity of semantic features supervision to enhance thesemantic and low-level capabilities. Subsequently, we further introduce attention-prioritized mapping strategy, which guides fine-grained generation with powerful semantic prior. Our method achieves rFID of 0.33 and zero-shot accuracy of 80.9% on ImageNet at 256×256 resolution, surpassing VILA-U by 7.6% and outperforms continuous embedding of SigLIP. When applied to discrete unified MLLMs, our 7B model exceeds TokenFlow-13B by 3.1% in understanding and achieve SOTA performance in GenAI-Bench and MJHQ-30K.
Paperid: 3851,   Poster  
Authors: Till Beemelmanns, Alexey Nekrasov, Stefan Vilceanu, Jonas Steinhaus, Timo Woopen, Bastian Leibe, Lutz Eckstein
Title: Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
Abstract: Reliable uncertainty estimation for 3D object detection is critical for deploying safe autonomous systems, yet modern detectors remain poorly calibrated, especially under distribution shifts.Although posthoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios.In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors.These queries form a compact, location and class-aware feature, ideal for density estimation, allowing our approach to adjust model confidences in distribution-shift scenarios.By fitting a density estimator on these query features, our approach jointly recalibrates both classification and bounding box regression uncertainties.On both a multi-view camera and LiDAR-based detector, our approach consistently outperforms standard post-hoc methods in both in-distribution and distribution-shifted scenarios.Our code will be made publicly available.
Paperid: 3852,   Poster  
Authors: Meng Wang, Changqun Xia, Yuze Wang, Junyi Wang, Wantong Duan, Xinxiong Xie, Yue Qi
Title: Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
Abstract: Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and highfidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from capturing multi-scale scene details. Furthermore, we propose a Global-to-Local Optimization strategy to refine the reconstruction of under-optimized regions resulting from imbalanced view distributions. Experiments across diverse urban scene datasets demonstrate that Urban-GS significantly outperforms the state-of-the-art method in novel-view rendering quality, while simultaneously reducing storage overhead by an average of 41%.
Paperid: 3853,   Poster  
Authors: Yan Li, Changyao TIAN, Renqiu Xia, Ning Liao, Weiwei Guo, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Title: AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a blockwise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
Paperid: 3854,   Poster  
Authors: Jingze Wu, Quan Zhang, Hongfei Suo, Zeqiang Cai, Hongbo Chen
Title: Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
Abstract: Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments.To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RLbased fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities.Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions.Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code will be released publicly.
Paperid: 3855,   Poster  
Authors: Yakun Chang, Zhaojun Huang, Siqi Yang, Yeliduosi Xiaokaiti, Shikui Wei, Yao Zhao, Tiejun Huang, Boxin Shi
Title: HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
Abstract: Capturing scenes with both high dynamic range (HDR) and highspeed motion remains challenging for conventional cameras. Existing alternating-exposure approaches exacerbate temporal resolution loss, making them unsuitable for high-speed scenes. Consequently, current solutions typically compromise either spatial resolution through fixed spatial-varying attenuation levels or employ multi-sensor configurations to maintain temporal resolution. In this paper, we leverage an ultra-high speed spike camera to enable spatial and temporal attenuation of incident light, thereby reconstructing high frame rate (HFR) and HDR video with a single sensor. We achieve this by placing a rapidly rotating spoke-pattern neutral density (SpokeND) filter in front of the spike camera, enabling each pixel to periodically capture multi-attenuated spikes while maintaining full spatial resolution. Building on these multi-attenuated spikes, we propose ReST-Net, which comprises the ReGain and ReFine modules. The ReGain module reconstructs spatially consistent frames by learning to recover relative gain from the multi-attenuated spikes, and the ReFine module removes temporal fluctuations to produce temporally consistent HDR videos. Extensive experiments on synthetic and real-world data demonstrate that our method can reconstruct HDR video at up to 2000 FPS.
Paperid: 3856,   Poster  
Authors: Alex Hoi Hang Chan, Neha Singhal, Onur Kocahan, Andrea Meltzer, Saverio Lubrano, Miya Warrington, Michael Griesser, Fumihiro Kano, Hemal Naik
Title: CHIRP dataset: towards long-term, individual-level, behavioural monitoring of bird populations in the wild
Abstract: Longterm behavioural monitoring of individual animals is crucial for studying behavioural changes that occurs over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behaviour monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of coloured leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of colour rings with a database. We use application-specific benchmarking to show that CORVID outperforms state of the art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.
Paperid: 3857,   Poster  
Authors: Liu Chuang, Yichao Cao, Xiu Su, Haogang Zhu
Title: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
Abstract: Testtime adaptation (TTA) has emerged as a promising solution to address real world domain shifts in medical image segmentation. Most current approaches adapt by updating or regularizing a pre-trained source model. However, they face two major issues: (i) the source models on which they rely are prone to overfitting under domain shift; (ii) in dynamic continual testing scenarios, error accumulation and class forgetting are further exacerbated. To overcome these limitations, we propose TanGo, a novel framework that combines Training to adapt with Foundation Guidance and Continual Style Calibration. During training, TanGo learns generalization priors from vision foundation models (VFMs) through distribution-level consistency learning. We incorporate stable low-frequency representations from a frozen encoder of VFMs as priors to guide the source model, constraining its output feature distribution to yield a more generalizable feature space. At test time, we introduce an instance-wise style calibration method that employs a learnable data decorator to transform dynamic test images back toward source-like distributions. Subsequently, a set of source-anchored constraints is applied to preserve semantic integrity in the transformed test images and align their distributions more closely with the enhanced source space. Extensive experiments on multiple medical image segmentation tasks demonstrate that TanGo achieves state-of-the-art performance. All code will be made publicly available.
Paperid: 3858,   Poster  
Authors: Zihang Chen, Zhu Liu, Changbo Yan, Jinyuan Liu, Risheng Liu
Title: HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision
Abstract: Thermal infrared (TIR) imaging enables robust perception in adverse conditions. However, it often suffers from complex degradations (e.g., fixedpattern noise and low-resolution) due to sensor limitations and environmental dynamics. Existing methods, whether traditional or learning-based, easily fail under composite and varying degradation. Pre-trained generative models showcase powerful capabilities for alleviating degradations but lack effective tools to adapt visible generative priors to TIR-specific characteristics. To overcome these challenges, we propose a Hierarchical Degradation Representation and Adaptation (HiDRA) framework to decompose the enhancement procedure into degradation representation estimation and generative model fine-tuning. The degradation representation estimation aims to disentangle TIR degradation patterns, which then guide the parameter adaptation for thermal image enhancement. Additionally, we introduce a hierarchical adaptation solution that aggregates learning across varying degradation levels, further improving robustness under various scenarios. Experiments across diverse types and degrees demonstrate the robustness of our approach and further validate its effectiveness on downstream tasks.
Paperid: 3859,   Poster  
Authors: Bowen Yuan, Sisi You, Bing-Kun Bao
Title: Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
Abstract: Embodied Question Answering (EQA) requires agents to navigate 3D environments, accumulate visual evidence, and reason over partial observations to answer questions. However, current agents struggle to maintain coherent, longhorizon behavior: planning remains reactive, causing inconsistent actions, while monolithic memories entangle all observations, hindering retrieval of the sparse but crucial evidence. We address these issues by reframing EQA through the lens of predictive processing, in which coherent behavior emerges from a prediction–correction loop grounded in stable priors. Guided by this perspective, we propose Predict Before You Explore (Pred-EQA), an architecture that integrates predictive planning with specialized memory. A high-level planner predicts where question-relevant evidence is likely to appear and generates a compact set of actionable exploration branches encoding long-horizon intent. A low-level executor then reduces uncertainty within these branches, revising predictions when they fail. A dual-memory system complements this process by separating slowly evolving structural priors from compact, question-relevant visual evidence, enabling consistent planning and efficient evidence accumulation. Through this prediction-guided exploration, Pred-EQA achieves coherent trajectories under partial observability. Experiments on OpenEQA and Express-Bench show that Pred-EQA achieves state-of-the-art results in both accuracy and exploration efficiency, demonstrating the benefits of prediction-driven embodied reasoning.
Paperid: 3860,   Poster  
Authors: Tao Wu, Chuhao Zhou, Guangyu Zhao, Haozhi Cao, Yewen Pu, Jianfei Yang
Title: When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering
Abstract: Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform wellposed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
Paperid: 3861,   Poster  
Authors: Arda Senocak, Sooyoung Park, Tae-Hyun Oh, Joon Chung
Title: How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
Abstract: We present the first scalable framework for training sound source localization (SSL) models using synthetic data from textto-X models. Although SSL has made notable progress, existing models remain constrained by limited-scale, uncurated real-world datasets that often suffer from semantic misalignment. Furthermore, the introduction of new SSL tasks and benchmarks has increased the need for more generalizable models. To address these challenges, we leverage synthetic data to create synthetic clones of the VGGSound dataset, enabling both fully synthetic and hybrid real–synthetic training. We demonstrate that synthetic data can effectively replace, refine, and scale real training datasets. Extensive experiments across multiple benchmarks show that synthetic data not only matches real data in performance but also enables significant improvements when combined with real samples. Our findings provide the first systematic evidence that synthetic data can serve as a scalable and effective approach for advancing SSL models.
Paperid: 3862,   Poster  
Authors: Runhao Mao, Hanshi Wang, Yixiang Yang, Qianli Ma, Jingmeng Zhou, Zhipeng Zhang
Title: The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Abstract: The integration of VisionLanguage Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Dataset and code will be released.
Paperid: 3863,   Poster  
Authors: Zhicheng Yang, Yichen Liu, Chang Ge, Xiaopeng Jiang
Title: Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
Abstract: Contrastive learning is widely used for generating multimodal data representations by aligning embeddings of different modalities of the same data samples. This alignment is achieved through a loss function that treats matched and unmatched modality pairs as positive and negative samples within a data batch. However, when extending contrastive learning to scenarios involving more than two modalities, existing approaches either rely solely on fully unmatched modalities as negative samples, or fail to distinguish between partially and fully unmatched modalities, thereby overlooking the finegrained contrasting relationships. To address this limitation, we propose Easy2Hard, a novel framework that explicitly separates partially and fully unmatched modalities. To learn from negative samples at improved granularity, Easy2Hard further introduces a sigmoid weighting curriculum that smoothly transitions the learning process from easy (partially unmatched) to hard (fully unmatched) contrasts. Comprehensive evaluations on five multimodal datasets across diverse domains demonstrate the superiority of our approach.
Paperid: 3864,   Poster  
Authors: Xinsheng Wang, Zhidong Yang, Xiaohua Wan, Renmin Han, Shuai Tang, Hao Dong, Fa Zhang, Bin Hu
Title: A supervised multi-task framework for joint cryo-ET restoration enabled by generative physical simulation
Abstract: Cryoelectron tomography (cryo-ET) enables in-situ visualization of cellular structures at near-native state, yet its practical utility is often hampered by extremely low signal-to-noise ratio (SNR) and severe missing wedge artifacts resulting from dose limitations and restricted tilt angles. While several computational methods have been proposed for reconstructing high-quality tomograms, the performance is still limited by the absence of accurate noise modeling and reliable ground truth data. To address this challenge, we propose cryoDeRec, a multi-task learning framework to jointly address denoising and missing wedge reconstruction in fully supervised manner. The main contribution of cryoDeRec is a dual-objective training strategy incorporated with synthetically corrupted tomogram and raw noisy tomogram, enabling simultaneous restoration of structural fidelity and reconstruction of missing information. The model is trained on a physically synthetic dataset generated by a novel imaging simulation pipeline that incorporates authentic noise distributions and isotropic structural priors. We evaluate cryoDeRec on four realistic cryo-ET datasets and two simulated datasets with extremely low SNR, all reconstructed using Weighted Back Projection (WBP). Extensive experimental results demonstrate that our method achieves high-quality restoration directly from raw tomograms without any pre-processing, outperforming existing state-of-the-art methods. Our findings show that training on a comprehensive simulated dataset, which captures realistic noise and structural diversity, enables models to generalize effectively to real cryo-ET tomograms. The code and datasets will be available upon acceptance.
Paperid: 3865,   Poster  
Authors: Raphael Maser, Siddhartha Gairola, Sukrut Rao, Bernt Schiele
Title: Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
Abstract: Foundational vision models have become the de facto standard for many vision tasks due to their strong performance. However, they are notoriously opaque and remain hard to interpret. We present ALOE (ALign Once to Explain), a onetime, label-free feature alignment based approach that efficiently converts foundational vision models into inherently interpretable B-cos variants. Once aligned, the B-cos backbone is used as a drop-in replacement across several downstream tasks—amortizing the cost of interpretability. ALOE is robust across pre-training paradigms (supervised, self-supervised, vision–language) and is 100–1000× more data-efficient than training from scratch. On classification, it outperforms fully-supervised B-cos models (e.g., +6.6 p.p. top-1 on ImageNet for ViT-B/16), retains strong linear probing, k-NN, and zero-shot transfer performance competitive with foundational backbones (DINOv3, SigLIP2) across diverse downstream datasets, while yielding well-localized and highly human interpretable explanations by design. Code and models will be released.
Paperid: 3866,   Poster  
Authors: Yanling TIAN, Shanshan Zhang, Di Chen, Jian Yang
Title: FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search
Abstract: Person search, which aims to detect and reidentify individuals in unconstrained scenes, faces an inherent conflict in one-stage models: pedestrian detection focuses on shared human features, while person re-identification requires identity-specific representations. Existing approaches, such as feature decoupling and loss re-weighting, primarily address this issue in later network stages but fail to resolve early-stage feature entanglement. To overcome this limitation, we propose FSLoRA, a Freq-Spatial Low-Rank Adapter that progressively decouples task-specific features at the backbone level. FSLoRA consists of a Spatial-Level Module (SLM), which employs LoRA and a mixture-of-experts to dynamically activate task-relevant spatial features, and a Frequency-Level Module (FLM), which transforms features into the frequency domain to selectively enhance task-relevant frequency components while suppressing task-irrelevant noise. By integrating both spatial and frequency-based adaptations, FSLoRA reduces feature interference, enabling more effective joint optimization. Extensive experiments on CUHK-SYSU, PRW, and Posetrack21 demonstrate that FSLoRA not only achieves state-of-the-art performance but also serves as a plug-and-play module adaptable to various person search frameworks, offering a unified and generalizable solution for one-stage person search.
Paperid: 3867,   Poster  
Authors: Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen
Title: Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
Abstract: Recent advances in 3D vision–language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning.However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging languagelevel reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec).Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
Paperid: 3868,   Poster  
Authors: Alireza Kheirandish, Jihoon Hong, Sara Fridovich-Keil
Title: KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
Abstract: Diffusion models have shown promising performance as datadriven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems.
Paperid: 3869,   Poster  
Authors: Chenhan Jiang, Yu Chen, Qingwen Zhang, Jifei Song, Songcen Xu, Dit-Yan Yeung, Jiankang Deng
Title: FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction
Abstract: The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of largescale training data with diverse and accurate camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy that identifies novel viewpoints which are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FVGen's effectiveness by scaling up the training of feedforward NVS models, achieving a significant improvement of 2.6 dB on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision.
Paperid: 3870,   Poster  
Authors: Mengling Xu, Sisi You, Li Yaning, Bing-Kun Bao
Title: ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
Abstract: Procedural sequence generation aims to create intermediate images through multistep processes, which is applied in industrial design, educational tutorial, and creative content inspiration. However, existing methods often focus on a specific domain or initialize several expert networks for different domains, which face three challenges. First, the poor generalization to unseen domains. Second, the parameter redundancy due to multiple expert networks.Third, the difficulty in adaptively determining the number of generation steps for different processes.To address these challenges, we propose ProcessMaker, a novel framework that harnesses the inherent generalization capabilities in Diffusion Transformers (DiTs) for procedural sequence generation. Concretely, we introduce three key innovations: (1) Self-supervised Representation Alignment to explore the generalized ability for unseen processes. (2) Sparse Masks for different domains without additional expert networks. (3) A sliding window strategy, which dynamically accommodates the generation steps based on the process complexity. Extensive experiments validate that our ProcessMaker achieves procedural sequence generation with generalization ability and adaptive steps, while using only 7.3% trainable parameters compared with the state-of-the-art method.
Paperid: 3871,   Poster  
Authors: Hua Hu, Zikang Zhou, Qian Zhou, Zihao WEN, Junjie Hu, Xinhong Chen, Zhengmin JIANG, Yung-Hui Li, Jianping Wang
Title: Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
Abstract: Reliable longhorizon trajectory prediction requires both high positional accuracy and physically plausible temporal motion consistency. However, existing methods suffer from two fundamental limitations. First, they overlook the inherent difference in prediction logic: near-future trajectories are primarily governed by historical dynamics, whereas distant-future behaviors are driven by high-level semantic context. Yet, most methods employ a unified decoding pathway that blurs the temporal distinction.Second, although the near future is relatively easier to predict, existing methods lack mechanisms for coherent trajectory propagation across time horizons, often resulting in kinematically implausible predictions with inconsistent heading evolution and degraded long-horizon performance. To address these challenges, we propose NDPNet, a dual-stage architecture that decouples near- and distant-horizon modeling into specialized pathways, with a dedicated transition module ensuring smooth temporal bridging. Furthermore, we introduce a novel motion-aware coherence loss that explicitly embeds kinematic priors to enforce trajectory consistency. Extensive experiments show that NDPNet achieves SOTA performance on Argoverse 2 and WOMD. Notably, on WOMD, it ranks 1^\textst in both minFDE_6 and minADE_6 across all standard horizons (3s, 5s, 8s) without ensemble learning or NMS post-processing, and is the first to achieve sub-1.75 minFDE_6 for 8s prediction, surpassing prior methods by a large margin. The code will be released subsequently.
Paperid: 3872,   Poster  
Authors: Rongge Mao, Chengqi Dong, S Kevin Zhou
Title: LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding
Abstract: Visual Autoregressive (VAR) modeling introduces a new paradigm for image generation by extending autoregressive mechanisms from nexttoken prediction to next-scale prediction, achieving remarkable performance. However, as the number of tokens increases rapidly with scale, processing full token maps at high resolution becomes computationally expensive. In addition, the inherently sequential nature of autoregressive modeling prevents parallel inference across scales, which further increases latency.To address these challenges, we propose LazyVAR, a training-free and plug-and-play acceleration method for VAR models. Our key observation is that the similarity of aggregated latent features between adjacent scales progressively increases with the scale index, reaching particularly higher values at larger scales. We treat this similarity as a Scale-Wise Update Index, which serves as the pruning criterion. Consequently, more tokens can be pruned at larger scales to improve efficiency. Furthermore, we propose Parallel Group Decoding, which leverages this high similarity at larger scales to decode tokens from different scales in parallel, further accelerating inference.Experimental results show that the proposed LazyVAR achieves up to a 2.94× speedup over FlashAttention-accelerated VAR models with negligible performance loss, allowing the Infinity-2B text-to-image model to generate 1024×1024 resolution images within 0.5 seconds on a single RTX 4090 GPU. Our code will be publicly available.
Paperid: 3873,   Poster  
Authors: Han Jiang, Haoyu Tang, Xiaoxuan Mu, Chen Li, Jihua Zhu
Title: Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table
Abstract: ZeroShot Temporal Action Localization (ZS-TAL) aims to classify and localize actions in untrimmed videos that are unseen during training. Existing training-based ZS-TAL methods typically rely on fine-tuning models on large-scale annotated training data. This can be impractical in real-world applications and damage its generalization. As a result, Training-Free ZS-TAL has gained attention, which directly leveraging Vision-Language Models (VLM) enables action localization without any additional training. However, current techniques perform test-time adaptation independently on each video, neglecting the potential benefit of accumulating knowledge from historical test videos. To address this, we propose a learnable lookup table (LLT) framework. During testing, we continuously update lookup table by incorporating high-confidence, diverse lookup candidates to construct action-positive lookup item. Additionally, we introduce a learnable residual module to adapt the corresponding lookup item to the current video context features. Finally, we employ refined activation scores to select accurate video frames and further adjust the text prototypes. This simple yet effective text-visual collaboration enables training-free ZS-TAL to harness historical videos. Extensive experiments show our method significantly outperforms state-of-the-art zero-shot VLM baselines, validating the effectiveness of our framework.
Paperid: 3874,   Poster  
Authors: Gasser Elazab, Frank Neuhaus, Tilman Koß, Malte Splietker, Aditya Date, Michael Unterreiner, Maximilian Jansen, Olaf Hellwich
Title: CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
Abstract: Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on wellpaved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we introduce a standardized evaluation protocol for road surface irregularities and a stereo-guided depth completion variant that achieves leading performance on CARD. Moreover, we benchmark state-of-the-art depth estimation models to establish strong baselines. We host CARD on Hugging Face with an open source SDK and standardized splits to enable public leaderboards and reproducible evaluation.
Paperid: 3875,   Poster  
Authors: Abhishek Kumar Sinha, Nitant Dube, Soma Biswas
Title: Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
Abstract: Fewshot Class-Incremental Learning (FSCIL) requires learning new classes from very limited data while preventing catastrophic forgetting. Existing methods rely mainly on visual features and are prone to overfitting, while recent vision–language models (VLMs) offer better transferability but suppress fine-grained information due to contrastive feature decorrelation. Moreover, current FSCIL approaches often use static or fully optimizable prompts, making them either rigid or susceptible to semantic drift in incremental sessions. We introduce QR-Prompt, a residual-driven framework that leverages the visual–textual feature residual of VLMs to recover discriminative fine-grained cues missing from the contrastive space. To ensure stability, we propose Discriminative Subspace Quantization (DSQ), which builds a discrete memory of residual subspaces. To enable plasticity, a Hierarchical Prompt Encoder (HPE) and Prompt Composer (PC) transform these discrete codes into continuous, class-adaptive prompts for novel classes. We derive bounds relating DSQ codebook size to generalization and classification margin, and achieve consistent improvements over state-of-the-art FSCIL methods on CUB200, CIFAR100, and miniImageNet. Our results show that residual-based quantization combined with hierarchical prompt composition yields stable and expressive VLM adaptation for FSCIL.
Paperid: 3876,   Poster  
Authors: Chuanmao Fan, Chenxi Zhao, Ye Duan
Title: SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion
Abstract: Recently, the community has witnessed significant progress in human modeling from a single view or multiviews, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent embedding, which is fed into an implicit field decoder for robust occupancy prediction. Evaluation on the THUman 2.1, MultiGarment dataset demonstrates that our system significantly outperforms state-of-the-art methods both qualitatively and quantitatively.
Paperid: 3877,   Poster  
Authors: Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen
Title: A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Abstract: Anticipating diverse future states is a central challenge in video world modeling. A key limitation lies in the computational cost of generating multiple plausible futures with existing world models. Recent work demonstrates that predicting the future in the latent space of a vision foundation model (VFM), rather than in raw pixel space, greatly improves efficiency. Despite this progress, efficient VFMbased world models are still predominantly discriminative, producing predictions that implicitly average over many possible futures. To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes feature differences between consecutive frames into a single compact “delta” token, effectively reducing redundancy among temporally adjacent feature maps. By representing futures as delta tokens, DeltaWorld efficiently generates multiple diverse predictions in parallel. Experiments on dense forecasting tasks demonstrate that DeltaWorld is capable of predicting futures that more closely align with real-world outcomes, while being orders of magnitude more efficient than existing generative world models. Code will be made publicly available.
Paperid: 3878,   Poster  
Authors: Jisoo Kim, Heeseok Oh
Title: NEAF: Natural Image Editing with Attention Fusion for Generalizable Tuning-Free Text-Guided Image Editing
Abstract: Diffusionbased text-to-image (T2I) models have enabled remarkable generative capabilities, yet precise text-based image editing that preserves the original’s structural and perceptual fidelity remains non-trivial. Existing approaches either rely on retraining with large bespoke datasets, incurring significant computational and curation costs, or adopt lightweight fine-tuning strategies that still require optimization and often fail in fine-grained or semantically complex edits.We propose NEAF (Natural image Editing with Attention Fusion), a novel zero-shot, universal tuning-free framework for arbitrary T2I models, obviating the need for dataset curation or retraining. NEAF introduces a lightweight, learnable XA-Conductor module that dynamically identifies salient cross-attention contributions pertinent to the edit. This module optimizes a weight vector to orchestrate an adaptive fusion of cross-attention maps derived from the source, edited, and reconstruction branches. This triadic-feedback optimization strategy ensures the precise instantiation of user directives while rigorously preserving the fidelity of quiescent regions.Extensive experiments validate NEAF as a flexible and general framework that consistently surpasses existing methods across diverse editing tasks, demonstrating particular dominance in complex, non-rigid editing scenarios where other approaches falter.
Paperid: 3879,   Poster  
Authors: Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman, Bernadette Bucher, Tarik Kelestemur
Title: Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
Abstract: Imitation learning enables robots to learn how to execute tasks via observation. However, realworld environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.
Paperid: 3880,   Poster  
Authors: Yaxin Zhao, Yang Wang, Wenya Guo, Sihan Xu, Xiangrui Cai, Xi Lin, Ying Zhang, Xiaojie Yuan
Title: Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
Abstract: Weakly supervised video anomaly detection (WSVAD) aims to localize frame-level anomalies using only video-level labels. This task is typically formulated within a multiple instance learning (MIL) paradigm, where each video is treated as a bag of snippets, achieving robust performance without requiring additional information.However, existing methods often struggle with noisy supervision signals. Normal snippets within abnormal bags are frequently misclassified as anomalies due to inaccurate anomaly scores. These misclassified instances act as noisy samples, introducing false supervision that hinders the learning of true anomaly patterns.In this work, we introduce D^2MIL, a Denoising–Debiasing framework within the Multiple Instance Learning paradigm designed to suppress noise and improve anomaly discrimination. Our approach integrates two key components:(1) Denoising Module: We introduce a dynamic drop rate to adaptively filter out suspected noisy samples during training, based on the observation that noisy samples incur higher training losses. (2) Debiasing Module: We leverage a vision-language model to re-evaluate the discarded samples. This recovers potentially valuable abnormal instances that were mistakenly removed, as they are similar to noisy samples but difficult for the model to recognize. D^2MIL is a general purpose denoising strategy that can be integrated into any MIL-based method. Our extensive experiments on the three benchmark datasets (ShanghaiTech, UCF-Crime, and MSAD) demonstrate that D^2MIL is compatible with diverse MIL frameworks and consistently enhances their performance.
Paperid: 3881,   Poster  
Authors: Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu Gao
Title: IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework
Abstract: Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing textto-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
Paperid: 3882,   Poster  
Authors: Vishal Pramanik, Maisha Maliha, Susmit Jha, Alvaro Velasquez, Olivera Kotevska, Sumit Jha
Title: Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
Abstract: We study conceptlevel forgetting in pretrained vision models: removing an entire semantic category so the system no longer recognizes that object in unseen images and contexts, rather than merely forgetting specific training examples. Prior work either applies blunt global projections or fine-tunes parameters, which can introduce collateral damage to unrelated features, add compute, and become unstable as forgetting strength increases. We introduce Contrastive Subnet Erasure (CSE), a training-free, encoder-centric edit that targets a compact set of channels most responsible for the class and attenuates them in a calibrated manner. The modification is algebraically folded into the subsequent layer, yielding no inference-time overhead and leaving task heads unchanged. To evaluate whether forgetting generalizes beyond the data used to specify the class, we introduce a cross dataset protocol in which the class is defined on a source dataset and performance is measured on a disjoint target dataset drawn from a different distribution with no shared images. This setup tests whether the model still fails to recognize the object when it looks different or appears in new scenes, and it avoids overfitting to patterns of the source dataset. Across CIFAR 10, CIFAR 100, and ImageNet under this protocol, CSE achieves stronger forgetting of the target class while better preserving non target utility than existing baselines in both single class and multi class settings. Overall, CSE provides a simple stable and deployment ready mechanism for class level unlearning in vision.
Paperid: 3883,   Poster  
Authors: Wubin Shi, Shaoyan Gai, Feipeng Da
Title: PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting
Abstract: 6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Objectlevel 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen object that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric–depth supervision to reduce floaters and appearance overfitting under sparse reference views. Next, in the pose estimation stage, we apply a two-stage learning-guided ICP initializer that exploits geometric features to obtain a stable initial pose. Finally, we introduce a 3DGS-based iterative pose refiner that aligns rendered and query images in both appearance and geometry, further improving pose estimation accuracy. Experiments on LINEMOD, GenMOP, and our real-world datasets show that PoseGaussian achieves significant improvements over baseline methods under model-free and sparse-view settings, demonstrating strong generalization to unseen objects and robustness to view sparsity.
Paperid: 3884,   Poster  
Authors: Jiahao Zhang, Joseph Liu, Young-Yoon Lee, Seonghyeon Moon, Victor Zordan, Guy Tevet, Karen Liu, Stephen Gould, Oren Jacob, Haomiao Jiang, Mubbasir Kapadia, Yizhak Ben-Shabat
Title: RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
Abstract: Success in generative modeling across language, image, and video demonstrates that large, wellcurated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure provides the first benchmark for fine-grained, per-category evaluation, revealing model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
Paperid: 3885,   Poster  
Authors: Yuanlin Wang, Ruiqin Xiong, Jiyu Xie, Zhenkun Zhu, Zhaofei Yu, Xiaopeng Fan, Tiejun Huang
Title: Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams
Abstract: Spike camera is a neuromorphic vision sensor with ultrahigh temporal resolution, capable of capturing fast-moving scenes by firing a stream of binary spikes. However, its relatively low spatial resolution limits the acquisition of fine-grained visual details, motivating research on spike camera super resolution (SCSR). Existing SCSR methods typically operate on fixed-length spike sequences, where the accessible information is confined to a local temporal neighborhood. Moreover, spike fluctuations hinder intensity information extraction. Both factors affect the performance of SCSR. To address these issues, we propose a hierarchical recurrent network named Spk2VidNet to reconstruct high-fidelity high resolution image sequences from low resolution spike data. To mitigate fluctuations, Spk2VidNet progressively exploits temporal correlations within spike stream to enhance feature representation by hierarchically enlarging temporal receptive fields. Within recurrent phase, we introduce an alignment module that leverages the motion consistency among multiple frames to jointly estimate and mutually refine inter-frame motions, achieving more accurate temporal alignment. In addition, we propose a fusion module to adaptively integrate neighboring aligned features based on multi-scale similarity for robust feature aggregation. We further propose a segment-wise training with state transfer strategy to efficiently model long-term dependencies with limited GPU memory, thereby leveraging rich subpixel cues for improved super resolution. Experiments on synthetic and real-captured spike data demonstrate that Spk2VidNet achieves state-of-the-art performance.
Paperid: 3886,   Poster  
Authors: Jeongbin Hong, Dooseop Choi, Taeg-Hyun An, KYOUNG AN AN, Kyoung-Wook Min
Title: CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
Abstract: Transforming image features from perspective view (PV) space to bird'seye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements---with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively---without increasing inference complexity, since the IVT network is used only during training.
Paperid: 3887,   Poster  
Authors: Hyunha Hwang, Xuan Nguyen, Hyuk-Jae Lee
Title: LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
Abstract: Vision Transformers (ViTs) have achieved stateof-the-art results across various vision tasks. To enable a practical deployment of ViTs on modern hardware systems, post-training quantization (PTQ) has been actively studied in recent years. In particular, Hessian-based block reconstruction approaches have demonstrated promising results in quantizing ViT models to ultra-low bitwidths (e.g., 4-bit). However, finding a representative approximate Hessian, a fundamental step in recent approaches such as APHQ-ViT and FIMA-Q, remains underexplored in terms of the quantization-induced error and estimation cost. To address these shortcomings, we first reveal that the sample independence assumption used in recent works, which ignores the covariance term, can lead to a significant approximation error, especially for sub-four-bits. Inspired by least-squares regression, we propose LS-ViT, a block reconstruction framework that effectively estimates a representative Hessian by explicitly minimizing this approximation error across all samples. Extensive experiments with various ViT models across different vision tasks demonstrate that LS-ViT achieves new state-of-the-art performance. In addition, LS-ViT reduces quantization time compared to prior work, enabling a practical, plug-and-play, quantization-aware deployment for ViTs. The code will be made available.
Paperid: 3888,   Poster  
Authors: Khushboo Mishra, Varun Trivedi, Tanima Dutta
Title: Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
Abstract: Unsupervised domain adaptation (UDA) for videos and 1D timeseries data faces significant challenges due to domain shifts in terms of both temporal dynamics and feature distributions. Existing UDA approaches for time-series data often address temporal alignment and uncertainty mitigation as separate objectives, leading to unstable training, noisy pseudo-labels, and incomplete feature transfer. This disjoint treatment fails to capture inter-channel causal dependencies and also overlooks the impact of prediction uncertainty on adaptation quality. This limits the transferability of learned representations and results in suboptimal adaptation. To address the aforementioned limitations, we propose a novel UDA framework, named Causally-Regularized Optimal Transport (in short Causal-OT), that preserves domain-invariant causal mechanisms by embedding causal graph regularization into robust OT alignment process. First we estimate inter-channel causal graphs in both source and target domains and learn a transport plan that not only aligns feature distributions but also improves interpretability and minimizes the discrepancy between causal structures of the Granger graphs. However, pseudo-labeling may still prone to error propagation allowing incorrect target predictions during self-training, degrading the model stability and transfer quality across domains. To mitigate this, we further introduce a causality-aware pseudo-labeling strategy that selects high-confidence target samples based on both entropy and structural consistency with the causal graph of the source domain. This enhances robustness against pseudo-label noise.Extensive experiments on six time-series benchmarks achieving 4.5% gain in accuracy and a 3.8% improvement in F1-score. We conduct experiments on four benchmark video datasets that achieve a 2.5% gain in accuracy.
Paperid: 3889,   Poster  
Authors: Yinuo Wang, Yanbo Fan, Xuan Wang, Boyao Zhou, Yu Guo, Yujun Shen, Fei Wang
Title: MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
Abstract: Dyadic interactive head generation aims to synthesize realistic head motions that respond both verbally and nonverbally to an interlocutor in real-time conversation. The existing works often focus on offline scenarios, and struggle with a shallow understanding of the multimodal conversational context while also lacking long-term coherence. To address these limitations, we propose MimicTalker, a novel method for producing real-time, contextually-aware, and long-term consistent interactive head motions. To this end, we propose a Multimodal Interactive Context Extraction (MICE) module to capture both instantaneous and long-term multimodal interactive information from the interlocutor. To enhance in-depth conversational understanding, we propose a Semantic-enhanced Dynamic Interaction (SDI) module to integrate the intentions and topics of the conversation, which are automatically extracted through an LLM-based analyzer. Further, we propose a semantic-guided Motion Style Memory (MSM) mechanism, enabling the long-term motion consistency throughout the conversation. We conduct experiments on both short conversational segments (25 seconds) and extended dialogues (6 minutes), and the comprehensive experiments demonstrate that our method significantly outperforms existing approaches.
Paperid: 3890,   Poster  
Authors: Qin Wang, Abigail Morrison, Hanno Scharr, Kai Krajsek
Title: TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
Abstract: Learning temporal transformations, that is, how visual objects evolve across frames, is a fundamental challenge in video representation learning. Frameto-frame dynamics involve complex, non-linear, and non-local changes that go far beyond conventional spatial augmentations. We propose TimeBridge, a self-supervised method that combines the joint embedding for video representation with learning temporal transformations by reconstructing in-between frames from only the start and end frames. This formulation encourages the model to infer the temporal evolution bridging the two endpoints, rather than merely encoding static frame representations. Unlike joint-embedding methods that lack explicit transformation modelling or future-prediction objectives that rely on unconstrained extrapolation, TimeBridge learns concrete frame-to-frame dynamics by promoting temporal consistency. We realise this through cross-concatenated class tokens and lightweight decoders, which recombine features from the start and end frames to reconstruct intermediates. TimeBridge achieves new state-of-the-art performance on multiple dense video prediction benchmarks, including 73.5 J&F on DAVIS 2017 video object segmentation, 47.5 mIoU on VIP part propagation.
Paperid: 3891,   Poster  
Authors: Zijia Dai, Nico Messikommer, Rong Zou, Nikola Zubic, Davide Scaramuzza, Laurent Kneip
Title: FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
Abstract: The demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct highspeed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss to constrain object motion, effectively mitigating overfitting. To ensure robust reconstruction, we employ an off-the-shelf model for depth correction and apply noise regularization terms in the final stage. We demonstrate robust results on both new synthetic and real-world datasets, highlighting our framework's ability to provide a simplified, event-only solution for high-fidelity 4D reconstruction in dynamic scenes.
Paperid: 3892,   Poster  
Authors: JIAXUN GUO, Wentao Fan, Manar Amayri, Nizar Bouguila
Title: 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
Abstract: Rotation invariance remains a core challenge in point cloud analysis, where existing methods often struggle with structural ambiguities and insufficient global context. Most rotationinvariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. Specifically, Ga4DPF introduces a learnable steerable transform that equivariantly lifts point clouds into 4D space, facilitating robust local feature construction and mitigating point-pair ambiguities. Concurrently, we model a dynamic global pose reference using the Bingham distribution, which adaptively estimates a consistent global rotation and enhances global feature discriminability. Extensive experiments on multiple benchmark datasets demonstrate that Ga4DPF achieves state-of-the-art performance with high computational efficiency, offering a new paradigm for rotation-invariant point cloud analysis.
Paperid: 3893,   Poster  
Authors: Yulong Liu, Hua Xu, Yiyang Cai, Chunyang Jiang, Sirui Han, Yike Guo
Title: Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
Abstract: Recent advances in fMRI pretraining have significantly improved visual decoding accuracy by leveraging crosssubject neuroimaging datasets. A prevailing strategy aligns individual fMRI signals into a shared feature space using subject-specific adapters, followed by a shared decoder. However, this unstructured feature space overlooks the redundancy and functional correlations among voxels and fails to incorporate the brain’s intrinsic functional architecture centered on regions of interest (ROIs).To address these limitations, we propose ROITok, an ROI-guided fMRI pretraining framework. Our method introduces Sparse ROI Context Fusion to learn ROI-level visual representations and captures functional synergy between ROIs from cross-subject data. Inspired by Matryoshka Representation Learning (MRL), we design an embedding compression scheme that prioritizes the most informative visual components first, with later tokens adding progressively finer but still useful details. ROITok achieves strong transfer learning performance on the NSD and GOD datasets and shows strong resilience against high-level additive noises, while offering better interpretability and enabling new applications. It allows for quantitative assessment of each brain region’s contribution to decoding tasks. Our analysis shows that ROI-based pretraining can automatically learn the brain’s visual hierarchy. Different ROIs can provide complementary contexts for decoding tasks; combining them improves decoding robustness.
Paperid: 3894,   Poster  
Authors: Yizheng Gong, Siyue Yu, Waleed Al-Nuaimy, Jimin Xiao
Title: Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning
Abstract: Visionlanguage models like CLIP excel at zero-shot recognition but struggle with continual learning due to two critical issues: (1) severe distribution gap between pretraining captions and post-training class names, and (2) performance mismatch between vision-only and dual-encoder approaches—vision-only methods achieve 20% higher accuracy on fine-grained tasks while CLIP dominates on natural images. We propose Learning from Itself (LfI), which mines CLIP's internal knowledge to address both challenges. First, we generate pseudo-captions by optimizing learnable tokens to minimize CLIP's contrastive loss, creating auxiliary training signals that bridge the pretraining-finetuning distribution gap without external models. Second, we introduce adaptive mutual distillation that dynamically weights knowledge transfer between CLIP's text encoder and a temporary vision classifier based on their instantaneous performance—stronger branches teach more, weaker ones learn more. At inference, only the original CLIP architecture is used, having absorbed discriminative knowledge from both branches. LfI achieves state-of-the-art results across multiple continual learning benchmarks, demonstrating that CLIP can effectively teach itself to continually learn new tasks.
Paperid: 3895,   Poster  
Authors: Ahyoung Oh, Wonseok Shin, Songkuk Kim
Title: Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Abstract: Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely underexplored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method establishes new state-of-the-art results on the FPR95 metric—critical for safety-sensitive applications—across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.
Paperid: 3896,   Poster  
Authors: Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li, Feng Gao, Fan Yang, Qingyao Wu, Ke Chen
Title: UAVCB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
Abstract: Detecting Unmanned Aerial Vehicles (UAVs) in lowaltitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB–T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency–spatial fusion gap and the cross-modality discrepancy gap in RGB–T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications. The UAV-CB dataset will be publicly released to support future research.
Paperid: 3897,   Poster  
Authors: Yiming Hao, Mutian Xu, Chongjie Ye, Jie Qin, Shunlin Lu, Yipeng Qin, Xiaoguang Han
Title: LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
Abstract: Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like LowRank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability.To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting sparse response maps that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing.
Paperid: 3898,   Poster  
Authors: Fan Yang, Yuanzhi Zhao, Haimei Zhao, Yudong Zhao, Haikun Xu
Title: Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
Abstract: In unsupervised crossmodal hashing, real-world data often exhibit partial alignment and semantic mismatch: dominant modalities tend to overrule fusion, fine-grained complementary cues are overlooked, and mini-batch “negative samples” are contaminated by semantically related items, yielding frequent false negatives. Treating all pairs equally in contrastive learning thus makes training noise-prone and ill-suited to partially aligned data. To mitigate these pains, we present Unsupervised Weighted Masked Contrastive Hashing (UWMCH), whose core is: (i) random masked fusion deliberately suppresses part of modality evidence during feature interaction, forcing the model to learn complementary semantics under diverse “partial interactions,” avoiding reliance on a single modality and explicitly exposing hard cases; (ii) pairwise weighting no longer treats masked and unmasked pairs as equivalent but adaptively assigns a weight to each cross-modal pair by combining instance-level semantic consistency with a K-means induced cluster-consensus prior, injecting the weight into the contrastive objective to suppress suspected false negatives and amplify more informative masked positives. To stabilize the global structure, we further introduce two constraints: Cluster-Centroid Agreement (CCA) forms global semantic anchors at the prototype level in synergy with UWMCH; Semantic Structure Regularization (SSR) builds higher-order semantic structure and aligns it with cross-modal similarity, maintaining intra-modal compactness and inter-modal separability under masking. Extensive benchmark experiments show that UWMCH achieves better retrieval accuracy and convergence stability across multiple datasets. The code will be released.
Paperid: 3899,   Poster  
Authors: Wei Tao, Yang Dai, Jincai Huang, Qing Tao
Title: The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
Abstract: Crafting adversarial examples can be formulated as an optimization problem. While signbased optimizers such as I-FGSM and MI-FGSM have become the de facto standard for the induced optimization problems, there still exist several unsolved problems in theoretical grounding and practical reliability especially in non-convergence and instability, which inevitably influences their transferability. Contrary to the expectation, we observe that the attack success rate may degrade sharply when more number of iterations are conducted. In this paper, we address these issues from an optimization perspective. By reformulating the sign-based optimizer as a specific coordinate-wise gradient descent, we argue that one cause for non-convergence and instability is their non-decaying step-size scheduling. Based upon this viewpoint, we propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes (MDCS) within sign-based optimizers. Typically, we further provide theoretical guarantees proving that MDCS-MI attains an optimal convergence rate of O(1/\sqrtT), where T is the number of iterations. Extensive experiments on image classification and cross-modal retrieval tasks demonstrate that our approach not only significantly improves transferability but also enhances attack stability compared to state-of-the-art sign-based methods.
Paperid: 3900,   Poster  
Authors: Jingqiao Xiu, Can Wang, Dong Xu
Title: MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to lowlevel perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmenting it with a jointly trained and sampled 3DGS decoder; and (3) a surrogate task that enhances feedforward editing capabilities. Extensive experiments demonstrate that MLLMSplat delivers state-of-the-art performance across 3DGS understanding, generation, and editing.
Paperid: 3901,   Poster  
Authors: Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Xingxuan Li, Lidong Bing
Title: OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
Abstract: Recent advancements in reasoning language models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual and video reasoning, the lack of transparent and reproducible data curation and training pipelines remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent twostage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874k-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74k-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 9.5% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.
Paperid: 3902,   Poster  
Authors: Shuo Zhang, Chenqi Li, Tingting Zhu
Title: Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
Abstract: Longtailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.
Paperid: 3903,   Poster  
Authors: Yixian Wang, HaoLin Yu, Jiadong Tang, Yu Gao, Xihan Wang, Yufeng Yue, Yi Yang
Title: FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting
Abstract: 3D Gaussian Splatting has revolutionized neural rendering with realtime performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.
Paperid: 3904,   Poster  
Authors: Xiaojun Deng, Tianchi Liao, Zhiyuan Liu, Chuan Chen, Zibin Zheng
Title: Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
Abstract: Oneshot Federated Learning (OSFL) has emerged as a promising paradigm to mitigate the high communication overhead of traditional federated learning. However, its effectiveness is often hampered by data heterogeneity across client data. While recent methods leverage pre-trained diffusion models to generate data for OSFL, they often struggle with some practical limitations, including a lack of semantic fidelity in capturing the fine-grained characteristics of local data, and insufficient diversity in the generated data, which collectively degrade the performance of the global model. To address these challenges, we propose \textttEspresso, a novel framework that enhances both the fidelity and diversity of synthetic data in OSFL. \textttEspresso consists of two main components: (1) Fine-Grained Condition Learning, which learns fine-grained conditional embeddings to improve semantic fidelity and diversity by modeling intra-category patterns, and (2) Semantics-Preserving Sampling, which diversifies the generated data by modeling the initial latent noise distribution and applying a self-reflection sampling strategy. Extensive experiments on benchmark datasets demonstrate that \textttEspresso can improve the semantic fidelity and diversity of the synthetic data, leading to a enhancement in the performance of the global model compared to state-of-the-art OSFL methods under data heterogeneity.
Paperid: 3905,   Poster  
Authors: Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang
Title: GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Abstract: Hand–Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated indomain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing competitors.
Paperid: 3906,   Poster  
Authors: Mingjie Ma, yichao ma, Zhong Yang, Guohui Li
Title: Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
Abstract: Multimodal large language models (MLLMs) have achieved remarkable success on diverse visuallanguage tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due todistracted attentionandblurry vision. To address these issues, we proposeSLoFo, a training-free and self-guided inference framework that mimics the human "Scan-Locate-Focus" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.62% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.
Paperid: 3907,   Poster  
Authors: Bei Chen, Gaolei Li, Jun Wu, Jianhua Li
Title: DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has given rise to sophisticated autonomous agents capable of performing complex, humanlike tasks across the web. However, this also introduces significant security risks, particularly from stealthy MLLM agents that can evade conventional detection mechanisms by mimicking human behavior. In this paper, we propose DualMirage, a novel CAPTCHA framework that proactively counters and identifies stealthy agents by exploiting fundamental disparities between human and machine perception. DualMirage employs a dual-pronged strategy: (1) Contour Illusions, which utilize cognitive principles to generate illusory contours that humans perceive effortlessly yet pose interpretation challenges for MLLMs; and (2) Adversarial Illusions, which embed human-imperceptible perturbations optimized to mislead the visual encoders of target MLLMs and thereby elicit characteristic, identifiable model responses. Evaluations on five state-of-the-art MLLMs demonstrate that DualMirage achieves an average 95.8% human success rate while blocking MLLM agents (up to 100% agent blocking rate), outperforming existing CAPTCHAs. Furthermore, DualMirage induces models to expose identities actively, achieving 58.8% white-box and 21.9% black-box attack success rates, proving effective against stealthy multimodal agents.
Paperid: 3908,   Poster  
Authors: Dingchuan Yu, Jiatong Li, Jingwen Zhou, Zhengyue Zhuge, Yueting Chen, Qi Li
Title: OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring
Abstract: Object motion blur in static scenes is spatially heterogeneous, differing from conventional deblurring problems yet frequently occurring in real handheld capture scenarios. Existing datasets either rely on costly beamsplitting capture with residual misalignment or employ synthetic blur that fails to model the continuous photon-integration process during exposure. To overcome these limitations, we introduce OMoBlur, a physically grounded dataset that emulates realistic exposure integration via programmable sensor control, ensuring close alignment between synthetic and real blur distributions. OMoBlur provides 20,000 blur–sharp–mask pairs covering diverse object motion types. Leveraging this dataset, we further propose OMDNet, an object-motion-aware deblurring network that integrates a Motion–Appearance Extract Block, a Flow-Guided Gate Predictor, and an Adaptive Gated Fusion mechanism. This design enables the network to selectively restore blurred regions while preserving static backgrounds, without requiring pixel-accurate mask annotations. Extensive experiments demonstrate that OMoBlur’s physically faithful data collection and large-scale diversity significantly enhance the network’s generalization to real-world motion blur, establishing OMoBlur and OMDNet as a robust benchmark and practical solution for local motion deblurring. The dataset and code will be publicly released.
Paperid: 3909,   Poster  
Authors: Yipene Bassole, Sungwoo Kim, Jiwoo Jung, Yunsick Sung
Title: HybridDriveVLA: Vision-Language-Action model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
Abstract: VisionLanguage-Action (VLA) models are emerging as an important technology in autonomous driving, recognized for their sophisticated reasoning and interpretability. However, traditional VLA models often rely on image-to-text with Chain-of-Thought (CoT) reasoning, which converts sequential visual scenes into textual symbols, thereby under-utilizing spatial context in visual information. Existing autonomous driving systems using VLA models predict only a single sequence of waypoints as a trajectory considering a given command and multiple aspects. However, we suggest evaluating each sequence of waypoints to reveal the importance of the corresponding aspect. We introduce HybridDriveVLA, a VLA model that integrates visual Chain-of-Thought (V-Cot) reasoning and a proposed Tree-of-Thought (ToT)-inspired waypoint evaluation (ToT-evaluation). V-Cot reasoning anticipates future scenes, which serve as goals for ToT-evaluation. The ToT-evaluation generates and scores waypoints based each on safety, progress, and comfort aspects. The highest cumulated score of the waypoints based on the three aspects is optimal. To the best of our knowledge, we are the first to propose a unified method integrating both a CoT and a ToT approach in a VLA model. Experimental results demonstrated that HybridDriveVLA achieved strong performance on comfort, progress, and safety metrics with and average collision rate of 0.17% on the nuScenes benchmark, outperforming traditional VLA models.
Paperid: 3910,   Poster  
Authors: Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen
Title: From Global Alignment to Local Semantics: Understanding Visual Representations Structures in Multimodal LLMs
Abstract: Multimodal LLMs (MLLMs) convert images into visual tokens for languagemodel processing, yet how these tokens encode semantics remains unclear. In this paper, we identify a consistent token structure across models: visual tokens cluster into sink, dead, and alive groups, with only the alive tokens (\approx60%) carrying meaningful information. Sink and dead tokens can be removed without hurting performance. Using a patch-compression benchmark and our probing tool EmbedLens, we show that alive tokens already encode fine-grained cues (objects, colors, OCR) before entering the LLM. Internal visual computation (visual attention and FFNs) are redundant and offers limited benefit for most tasks. This redundancy also extends to the model's depth: Our analysis shows that alive tokens align best with mid-layer LLM representations, while shallow layers contribute little. These findings provide a unified view of visual semantics in MLLMs and motivate architectures that use fewer visual tokens, reduced visual computation, and mid-layer injection for better efficiency and interpretability.
Paperid: 3911,   Poster  
Authors: Jueqing Lu, Yuanyuan Qi, Xiaohao Yang, Shuaicheng Niu, Fucai Ke, Shujie Zhou, Wei Tan, Jionghao Lin, Wray Buntine, Hamid Rezatofighi, Lan Du
Title: DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
Abstract: The performance of Vision–Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missingaware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. We introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. For each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information actually present. This adaptive design allows DPL to handle inputs with missing modalities more effectively while remaining fully compatible with existing prompt-based frameworks. Extensive experiments on MM-IMDb, UPMC Food-101, and Hateful Memes demonstrate that DPL outperforms state-of-the-art approaches across all widely used multimodal image–text datasets and various missing cases.
Paperid: 3912,   Poster  
Authors: Yuan Wang, LI XIANG, Yali Li, XUEGE HOU, Shengjin Wang
Title: HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction
Abstract: Unified interpreting and synthesizing human behaviors within 3D environments is vital for advancing spatial intelligence and humanoid robotics. Despite recent advancements (e.g., HSIGPT), two fundamental capabilities expected of a unified model—understanding and generation—still lag behind specialist models. This is primarily due to 1) single-granularity codebook overemphasizes low-frequency motion details while neglecting motion semantics, 2) limited decoding capacity of the motion detokenizer which restricts the fidelity of human–scene interactions, 3) only relying on supervised fine-tuning (SFT) failing to capture high-level motion semantics and logical reasoning with an end-to-end mapping. To this issue, we develop HSI-GPT2—a reasoning-enhanced, dual-granularity motion-representational large Scene-Motion-Language model, powered by reinforcement learning (RL) with Chain-of-Thought (CoT) reasoning. First, HSI-GPT2 introduces a Dual-granularity Motion Tokenizer, DMoTok, which jointly preserves both fine-grained motion details and text-aligned motion semantics for various HSI-related tasks. Further, a motion diffusion decoder functions as a motion detokenizer, translating deep semantics and detailed features of LLMs into physically grounded human motions. Finally, we curate a Motion Chain-of-Thought (MoCoT) data engine and extend a Group Relative Policy Optimization (GRPO) paradigm to execute long-horizon and compositionallyrich commands. Results on standard HSI benchmarks confirm the clear superiority of HSI-GPT2 in enhancing interaction quality, semantic alignment, behavioral diversity, and generalization to unseen 3D scenes.
Paperid: 3913,   Poster  
Authors: Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma
Title: Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
Abstract: Textto-Image (T2I) generation technology has achieved remarkable progress in recent years. Concurrently, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and have been successfully applied to T2I tasks. However, the uniform sampling strategy commonly adopted during training often ignores the match between sample difficulty and the model’s current learning capability, leading to low training efficiency. We argue that the key to unleashing the model’s potential lies in continuously providing ``high-value samples'' that match its evolving competence. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt is used to generate a group of images, and a reward model assigns a reward to each image. We use the variance of these rewards as a proxy indicator—higher variance implies the model's understanding of the prompt is still unstable, indicating stronger learnability and thus higher value. CGPO adaptively constructs the curriculum by dynamically identifying and selecting high-value samples for training based on reward variance. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.
Paperid: 3914,   Poster  
Authors: Jiuyang Dong, Jiahan Li, Junjun Jiang, Yongbing Zhang
Title: FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment
Abstract: Wholeslide images (WSIs) in computational pathology contain billions of pixels, making end-to-end training of feature extractors and multi-instance learning (MIL) networks infeasible on a single commodity GPU.Existing methods often freeze the feature extractor and train MIL networks on the resulting frozen features, which introduces a semantic gap that limits downstream performance. To address this issue, we propose FBTA, a Feature Bridging and Translation Alignment framework for WSI classification. FBTA is the first end-to-end MIL framework trainable on a single 24\,GB GPU, leveraging three complementary feature-bag views: end-to-end features enable joint optimization, frozen features stabilize training, and translated features support practical inference.Experiments on diverse datasets, including TCGA-NSCLC (Shot20/50/100) and TCGA-STAD, demonstrate the effectiveness and generality of FBTA, which consistently improves performance across three MIL architectures and two extractors. For example, with ResNet-50 as the extractor, FBTA improves the accuracy of the classic ABMIL by 13.1% and 15.8% on the NSCLC-Shot50 and TCGA-STAD datasets, respectively, and further enhances the state-of-the-art MambaMIL by 4.1% and 9.2% on the same datasets. Moreover, FBTA yields additional gains for MIL models that incorporate self-supervised pretraining strategies and data augmentation techniques.These results suggest FBTA is a feasible and scalable framework for end-to-end MIL on gigapixel WSIs. The code will be available.
Paperid: 3915,   Poster  
Authors: Shengzhe Chen, Hao Yan
Title: D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
Abstract: Convexity is a fundamental geometric prior that underlies many natural and manmade structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on quasi-concavity of the network output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification operator, a gradient based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed by a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing prior shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1–0–1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors under one continuous, differentiable framework.
Paperid: 3916,   Poster  
Authors: Jiaying Ying, Heming Du, Kaihao Zhang, Sean Tweedy, Xin Yu
Title: ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
Abstract: Singleimage human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human–computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs.In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. Together, these modules introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy.Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, compared with SMPLify-X, ResiHMR reduces intact-joint 2D MPJPE from 41.32 to 37.40, increases mIoU from 0.662 to 0.703, and improves anatomical plausibility in expert ratings.
Paperid: 3917,   Poster  
Authors: Yunlong Zhao, Xiaoheng Deng, Yichao Cao, Yi Chen, Xiangjian He, Shan You, Shuo Yang, Lei Fan, Fei Wang, Xiu Su
Title: Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation
Abstract: Robotic manipulation in complex 3D environments requires unifying spatial reasoning with intuitive visual perception, which is a capability that current VisionLanguage-Action paradigms address separately. While 3D VLAs excel in geometric and physical reasoning, they lack intuitive, image-level understanding and dense visual semantics; conversely, 2D VLAs (even with depth image) provide rich visual intuition and semantic continuity but miss explicit spatial global grounding. We introduce DiffRender-VLA, a differentiable rendering–based framework that bridges 3D and 2D Vision-Language-Action models through gradient-consistent visual mediation. It generates differentiable images by localizing the next end-effector target with a world-aligned cube marker, differentiably structuring surrounding geometry whose color encodes spatial relations to the marker, and rendering adaptive viewpoints optimized to reveal the target–environment spatial relationships. These differentiable images serve as visual bridges, embedding spatial semantics while allowing gradients from 2D VLAs to backpropagate into 3D representations, thereby coupling geometric reasoning with visual perception. This closed differentiable loop unifies reasoning and perception, substantially improving performance under occlusion, clutter, and complex spatial manipulation tasks, achieving average improvements of +12.1% over state-of-the-art methods.
Paperid: 3918,   Poster  
Authors: Xijie Xiang, Lin Zhu, Wei Zhang, Yonghong Tian
Title: Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
Abstract: Autofocus in dynamic environments remains challenging for conventional framebased sensors, which often fail under fast motion, low light, or high dynamic range conditions. Event cameras, with microsecond temporal resolution and asynchronous brightness detection, offer a promising alternative. However, typical event-based autofocus methods assume that the sharpest focus corresponds to the maximum event rate.In this paper, we reveal a counterintuitive yet consistent phenomenon: the true focus actually corresponds to a local minimum in the event-rate curve. We theoretically derive this behavior from the physics of event generation and show that as defocus blur increases, the event rate first rises and then declines, forming a dual-peak-valley structure across focal distances. Based on this insight, we propose an Event Structural Valley-based Autofocus (ESVA) framework that identifies the valley between two dominant peaks as the true focal position. ESVA integrates structural smoothing, consistency filtering, and a dual-peak constraint to robustly recover the valley under noise and motion disturbances. Extensive experiments on multiple synthetic and real datasets demonstrate that ESVA achieves sub-millisecond focusing accuracy and consistently outperforms existing event-only autofocus methods without any image reconstruction or supervision.
Paperid: 3919,   Poster  
Authors: Jialei Zhan, Li Liu, Jiehua Zhang, Yuhang Xie, Yongxiang Liu, Jiangming Chen, Ming-Ming Cheng
Title: Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection
Abstract: Recent advancements in remote sensing object detection have predominantly focused on oriented bounding box design and small object feature enhancement, while often overlooking the intrinsic geometric properties of remote sensing images, such as rotation invariance and structural symmetry. Many aerial objects appear in multiple orientations and exhibit clear symmetrical patterns, which, if not explicitly modeled, can lead to detection failures and inaccurate localization under geometric variation or partial occlusion. To address this, we propose the Rotation Invariant and Symmetry Aware Pixel Difference Network (RISPiDiNet), which introduces a novel convolutional operator called Rotation Invariant and Symmetry Aware Pixel Difference Convolution (RIS-PDC). This operator replaces traditional convolution with a mathematically grounded formulation that encodes rotation group priors and symmetrical constraints. RIS-PDC utilizes pixel differences and symmetry-guided aggregation in the polar harmonic space, enabling the network to infer partially visible structures and deduce occluded symmetrical parts. Besides improving detection accuracy, RIS-PDC enhances model interpretability by embedding geometric principles into the network design. Feature visualizations demonstrate rotation-consistent activations and symmetry-complete responses, revealing how the network captures underlying object structure even under partial visibility or orientation changes. This yields geometrically interpretable detection decisions. To our knowledge, RIS-PiDiNet is the first remote sensing object detection framework that jointly incorporates rotation invariance and symmetry modeling within a unified architecture. Extensive evaluations on standard benchmarks validate its effectiveness, achieving state-of-the-art performance on DOTA-v1.0 (78.53% mAP single-scale, 81.81% multi-scale), HRSC2016 (98.60% mAP), and DIOR-R (67.28% mAP), all with acceptable computational overhead and no increase in parameter count.
Paperid: 3920,   Poster  
Authors: Yuan Gui, Hongchen Luo, JiaoWang JiaoWang, Qu Liqi
Title: Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving
Abstract: Current research in autonomous driving heavily relies on largescale driving datasets for model fitting or trial-and-error learning strategies in simulation environments. However, these approaches suffer from limited behavioral diversity and fail to cover complex edge-case interactions. To address these limitations, we model the driving environment as an Active Markov Game (AMG) and introduce a multi-agent co-evolutionary training framework for more realistic interactive learning. The AMG formulation extends traditional Markov games by explicitly making state transitions and rewards dependent on the evolving strategies of the agents, thus capturing the interactive dynamics and strategic coupling between the ego vehicle and surrounding agents. Building on this, our multi-agent co-evolutionary training mechanism jointly optimizes the ego vehicle's policy and a diverse pool of opponent strategies, allowing all agents to adapt to each other's behaviors during training. This game-theoretic approach produces a robust ego agent capable of handling diverse, non-stationary driving strategies, overcoming the "non-responsive opponent" limitation found in prior methods. In CARLA simulations of unsignalized intersections and long-tail scenarios, our method performs exceptionally well, achieving near-perfect success rates (98%) with minimal collisions (2%), and significantly outperforming state-of-the-art baselines such as PPO, DDPG, and IPPO in terms of generalization, safety margins, and control smoothness. These results demonstrate that our approach substantially enhances the robustness, safety, and strategic adaptability of autonomous driving in complex multi-agent environments.
Paperid: 3921,   Poster  
Authors: Yunlong Zhao, Xiaoheng Deng, Hongyan Xu, Zhuohua Qiu, Xiaowen Hu, Shan You, Yi Chen, Chang Xu, Xiu Su
Title: FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection
Abstract: Federated Learning (FL) faces a fundamental dilemma: existing defenses against gradient leakage attacks (GLAs) invariably sacrifice model performance for privacy protection through noise injection or gradient clip. We introduce Federated Learning with MomentumBased Orthogonal Projection (FedMOP), a method that simultaneously achieves strong privacy guarantees and superior model performance. The key insight is to leverage initialization-based offset mechanisms that operate on orthogonal dimensions. For performance enhancement, FedMOP employs gradient orthogonal projection to counteract local drift, effectively offsetting each client's round-training initial model using global statistical context. For privacy protection, it introduces momentum-based trajectory offset hiding, which makes the offset vector inherently unrecoverable by constructing information barriers through private initialization and randomized evolution. These two mechanisms are synergistic rather than antagonistic. Theoretically, we prove convergence preservation and characterize the computationally infeasible inverse problem faced by attackers. Extensive experiments on CIFAR-10/100 and Tiny-ImageNet demonstrate that FedMOP not only defends effectively against state-of-the-art GLAs but also surpasses existing FL methods in both accuracy and convergence speed, validating its ability to jointly enhance privacy and performance in FL.
Paperid: 3922,   Poster  
Authors: Soyeon Yoon, Chang Wook Seo, Hyunjung Shim
Title: SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
Abstract: Learning dense correspondences across deformable 3D shapes remains a longstanding challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency.We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space.This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing.SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy–efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence. Code and pretrained models will be released upon acceptance.
Paperid: 3923,   Poster  
Authors: Songpengcheng Xia, Qingyu Zhang, Zhuo Su, Jiarui Yang, Zengyuan Lai, Qi Wu, Ling Pei
Title: FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
Abstract: Fullbody motion estimation from sparse VR observations is an inherently under-constrained problem, with only three 6-DoF trackers (HMD and controllers) available to infer a full skeletal pose. To address this ambiguity, we introduce a probabilistic framework that models joint orientations as distributions on SO(3) using the Matrix Fisher distribution. Instead of predicting a single deterministic pose, our network outputs a distribution for each joint, whose mode and concentration directly quantify prediction uncertainty on the rotation manifold. This enables likelihood-based training and principled uncertainty propagation. At the core of our model is a causal Transformer encoder that fuses sparse observations with motion history. We further propose region-wise tokens for the torso, arms, and legs, obtained via attention pooling over local joint features and semantic VR anchors. These tokens guide compact, per-region Fisher regression. To ensure kinematic coherence efficiently, we employ a limb refinement module, where each child joint's Fisher parameters are conditioned on its parent's distribution and the regional context, propagating pose and uncertainty hierarchically. Extensive experiments on standard sparse-VR benchmarks show that our approach achieves state-of-the-art performance, while providing well-calibrated joint-wise uncertainty.
Paperid: 3924,   Poster  
Authors: WENXUAN CHENG, Ming Dai, Huimin Lu, Wankou Yang
Title: DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
Abstract: Referring video object segmentation (RVOS) aims to segment objects within a video according to natural language expressions. Unlike earlier works focusing on static singleobject scenarios, recent studies address more complex motion scenes. Previous methods typically adopt a query-based, logically multi-stage pipeline to handle these scenarios. However, this paradigm learns trajectory consistency modeling and multimodal fusion from scratch, which often leads to trajectory inconsistencies and insufficient multimodal understanding. To address these limitations, we propose DeRVOS, a framework that decouples RVOS into two key branches: consistent trajectory generation and multimodal understanding. We extract temporally consistent object representations using a powerful pretrained instance trajectory generation model and perform cross-modal alignment via a unified multimodal encoder, enabling upstream modeling of trajectory consistency and vision-language understanding. This design reduces RVOS to the task of modeling the relationship between referring expressions and instance trajectories. To connect the two branches and enable efficient motion-aware semantic understanding, we introduce the Trajectory Alignment and Implicit Selection (TAIS) module, which progressively performs cross-frame multimodal alignment and motion-guided implicit trajectory selection. Extensive experiments demonstrate that DeRVOS achieves state-of-the-art results on both traditional RVOS benchmarks and the challenging MeViS dataset, surpassing LVLM-based methods by 4.7%.
Paperid: 3925,   Poster  
Authors: Hongrui Cai, Junjie Luo, Zhihong Fu, Shengnan Zhu, Wenjiawei Wenjiawei, Wanquan Feng, Songtao Zhao, Qian HE
Title: Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
Abstract: Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a singleview video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data, we establish a synthetic data pipeline integrated with our training strategy to enhance precision. Qualitative and quantitative results demonstrate a positive correlation between performance and training data volume, confirming our scalability.
Paperid: 3926,   Poster  
Authors: Zeyao Liu, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Yuexin Xuan, XIAOSHUANG JI
Title: Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
Abstract: Existing ViT backdoor attacks based on backboneoverwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic, context-aware architectures innately creates a far more dangerous, emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are inseparably fused into the same sparse, high-magnitude parameter core. This fusion creates an unsolvable ``hostage" dilemma, as pruning the attack necessarily destroys the benign performance.Comprehensive evaluations show VIPER resolves the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.
Paperid: 3927,   Poster  
Authors: Boyuan Wang, Richard Jiang
Title: Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
Abstract: We begin by developing a mathematical framework of neural differentiation, formulated at the level of individual neurons. This framework formalizes the principle that each neuron should acquire a distinct representational role within the network, thereby avoiding redundancy and maximizing collective expressivity. Differentiation is quantified through the Neural Differentiation Index (NDI), a lossaware measure that characterizes neuron significance from geometric, informational, and curvature-based perspectives within a unified framework. The NDI enables a rigorous characterization of how strongly a neuron diverges from its peers in both function and importance, and supports theoretical guarantees: we establish formal bounds on the error increase under NDI-guided elimination, thereby providing provable safety margins for network compression. Building on this foundation, we introduce Neural Differentiation Pruning (NDP) as a practical instantiation. NDP leverages NDI to perform adaptive, training-time neuron sparsification, followed by targeted fine-tuning, guiding networks toward compact yet highly differentiated backbones. Although the terminology draws loose intuition from biological differentiation, the framework is fully mathematical and architecture-agnostic. Experiments on modern vision benchmarks and architectures show that NDP achieves substantial structured sparsity while maintaining—or even improving—accuracy and robustness, underscoring the practical impact of the differentiation framework.
Paperid: 3928,   Poster  
Authors: Yikai Huang, Renmin Han, Yuxuan Wang, Youcheng Cai, Ligang Liu
Title: Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
Abstract: Segment Anything Model (SAM)based approaches have demonstrated remarkable potential for biomedical image segmentation. However, these methods often struggle to maintain spatial consistency in 3D electron microscopy (3D-EM) data and require extensive manual annotations. To this end, we propose Spatial-SAM, a spatially consistent and annotation-efficient framework that achieves high precision on 3D-EM data. Our method introduces two key innovations. First, we incorporate a 3D Signed Distance Field (SDF) memory mechanism that replaces the original memory in SAM2 with SDF representations precomputed by a 3D U-Net, providing richer geometric information and improving spatial consistency. Second, by combining the few-shot capability of SAM2 with a dual-track pseudo-label iterative optimization strategy, Spatial-SAM efficiently learns to segment large-scale 3D-EM datasets from minimal annotations. Experiments show that Spatial-SAM significantly outperforms existing semi-supervised methods and achieves performance comparable to state-of-the-art fully supervised approaches on multiple 3D-EM benchmarks, reducing annotation costs while preserving spatial consistency. The code will be publicly released upon acceptance.
Paperid: 3929,   Poster  
Authors: Yan Li, Guiping Cao, Yaguang Song, Ming Tao, Haoran Gong, Jun-Hui Liu, Yaowei Wang, Dongmei Jiang
Title: DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
Abstract: Model merging offers a promising paradigm for consolidating multiple expert models into a single multitask architecture. However, its effectiveness is often hindered by task interference, where conflicting parameter updates from different tasks degrade performance. While dynamic, Mixtureof-Experts based methods have improved adaptability, they are fundamentally limited by constructing their expert pools from task vectors in isolation, failing to resolve underlying structural conflicts across tasks. In this paper, we introduce DuetMerging, a novel framework that synergistically mitigates task interference from both dynamic and static perspectives. Dynamically, we apply Tucker decomposition to a unified tensor of task vectors, creating a harmonized expert pool derived from a shared core tensor that structurally enhances synergies and suppresses conflicts. Statically, we introduce a neuron-based sparsification technique that leverages task-specific neuron activation patterns to create a precise mask. This allows us to selectively preserve critical information from the decomposition's residual while suppressing functionally irrelevant or conflicting parameters. Comprehensive experiments demonstrate that DuetMerging outperforms existing methods, establishing a new state-of-the-art in both task performance and parameter efficiency.
Paperid: 3930,   Poster  
Authors: Minxue Tang, Yangyang Yu, Aolin Ding, MAZIYAR BARAN POUYAN, Taha Belkhouja, Yujia Bao
Title: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Abstract: Recognizing implicit visual and textual patterns is essential in many realworld applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.
Paperid: 3931,   Poster  
Authors: Pushpak Pati, Bo Li, Abbas Khan, Tomé Albuquerque, Steffen Jaensch, Amina Mollaysa, Walid Hassan, Samantha J Allen, Joke Reumers, Helai Mohammad, Scott Oloff, Tommaso Mansi, Rui Liao, Dmytro Lituiev, Zhoubing Xu
Title: BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
Abstract: Compound activity modeling is critical for drug discovery, where accuratein silicopredictions can significantly reduce reliance on expensive, time‑consuming targetspecific experimental assays. Traditional machine learning approaches for compound activity modeling typically rely on either chemoproteomics-centric molecular data or phenotype-centric imaging screens, limiting their ability to capture complementary biological signals. While multimodal approaches show promise, they often fail to capture the interplay between molecular mechanisms and cellular responses. In this paper, we presentBiGMINT, aBiologicallyGuidedMultimodal framework that hierarchicallyINTegrates chemoproteomic and high-content imaging (HCI) data, introducing chemoproteomics-guided phenotypic aggregation, task-aware cross-modal fusion, and protein–protein interaction priors for modeling activities. On two large-scale in-house datasets, with 99K and 40K compound–HCI pairs from U2OS and iNeuron,BiGMINTimproves mean AUCROC by up to 10.0% and 4.2%, and high-performing task coverage by up to 103% and 56% over best unimodal and multimodal methods. Thorough analysis revealed mechanistic insights, showing these gains stem from modality complementarity, and protein–protein priors enhance modeling of challenging activities. Code will be released for reproducibility on acceptance of the paper.
Paperid: 3932,   Poster  
Authors: Dong Wei, Huaijiang Sun, Fan Liu, Yuhui Zheng
Title: Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models
Abstract: Many recent human motion prediction methods adopt a multistage refinement framework, where each stage produces an initial guess of future poses for the next stage. These guesses are progressively refined towards the target prediction through a sequence of spatial-temporal reasoning stages.However, such a cascaded design incurs large computation and memory overheads that grow at least linearly with network depth, and lack an explicit stopping criteria.In this paper, we propose MotionDEQ, a deep equilibrium motion predictor that reformulates progressive guessing paradigm as a fixed point problem within an implicit layer. This formulation is conceptually equivalent to performing infinitely many refinement steps, but requires only O(1) training memory and can be solved efficiently through any black-box solvers. We carefully design this implicit refinement process by integrating Euclidean geometric transformations into equilibrium learning, allowing the entire network to be equivariant. We also find DEQs naturally fit the real-world scenario where motion data comes streamingly: the converged fixed point can be reused as a warm initial guess, to help recycle the redundant inference computation when making subsequent predictions.Our experiments demonstrate that MotionDEQ achieves the state-of-the-art prediction performances with superior memory efficiency, using fewer than 300K parameters with 55.3mm prediction error at 400ms on the Human3.6M dataset.
Paperid: 3933,   Poster  
Authors: Zhipeng Sui, Haiqing Hao, Weihua He, Seng-Hong Lee, Wenhui Wang
Title: Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
Abstract: Most eventbased algorithms typically split the event stream into fixed groups (e.g., fixed time or fixed count) for downstream processing, lacking adaptivity to scene dynamics. Several adaptive partitioning strategies have been proposed, but they are unable to cope well with heterogeneous velocity scenarios (HVS) involving both fast- and slow-moving objects. To address this issue, we propose Adaptive Spatial-Temporal Window (ASTW) strategy, which simultaneously achieves temporal adaptivity and spatial locality in event partitioning. Based on the principle of maximum entropy, we derive a patch-level time window determination criterion and efficiently implement it based on event density and vectorized calculations. Experiments on publicly available event-based object detection and tracking datasets demonstrate that ASTW significantly outperforms existing state-of-the-art partitioning strategies. We also construct HetVel, the first RGB-event dual-modality dataset for HVS, and further highlight the advantages of ASTW on this challenging benchmark. We believe that our ASTW strategy and the constructed HetVel dataset will advance the field of neuromorphic vision.
Paperid: 3934,   Poster  
Authors: Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo
Title: STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Abstract: Trainingfree zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions.To address these challenges, we introduce a novel Semantic Transition and Transportation in Collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score.Extensive experimentsdemonstrate that our method can be general, effective, and beneficial for many CIR tasks.The code is attached in the supplementary material.
Paperid: 3935,   Poster  
Authors: Fankang Xu, Lu Jin, Yanpeng Sun, Shiyu Xuan, Zechao Li
Title: Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning
Abstract: Continual Learning (CL) provides an effective paradigm for acquiring new knowledge, and the principle of learning without retaining past samples has led to exemplarfree CL that better matches practical conditions. However, a key challenge is the semantic shift, which requires reliable activation of past class representations to align with the current feature space. While drift compensation acts as the activator, it commonly assumes uniform semantic distributions and shifts, which is unrealistic for random data streams. For this, we propose the Dual-Estimator (Dual-E) to decouple global and local semantic shifts, addressing both issues of non-uniformity. Specifically, to address intra-task non-uniform semantic distributions that limit effective compensation for low-frequency semantics, Dual-E incorporates a mixture-of-experts estimator comprising multiple networks that model semantic shifts across diverse local representation spaces. For inter-task non-uniformity in semantic shifts, where uniform full-scale compensation potentially overlooks the varying degrees of semantic change across classes, Dual-E employs a low-rank estimator with an embedded low-rank network that prioritizes global semantic trends for classes exhibiting larger shifts. Dual-E leverages analytical solutions to update within a few epochs, enabling efficient plug-in integration with existing exemplar-free methods. Extensive experiments on diverse datasets demonstrate the advantages of Dual-E over state-of-the-art approaches. The code will be released.
Paperid: 3936,   Poster  
Authors: Ganlong Zhao, Zijia Tang, Xingping Chen, Zhanghui Kuang, Ye Tian, Guanbin Li
Title: FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation
Abstract: VisionLanguage-Action Models~(VLAs) have demonstrated significant promise in generalizing to complex, long-horizon robotic manipulation tasks. However, their performance remains brittle, as they are typically trained on trajectory-monotonic, failure-free demonstrations. This reliance on "perfect" data leaves them unable to recover from common execution errors, such as a missed grasp, a dropped object, or an unexpected collision. In this paper, we propose FLARE, a novel framework that endows VLAs with robust error recovery capabilities through a "Retry" and "Reset" paradigm. First, we introduce a "Retry" mechanism by injecting perturbation and bridging segments that decouple robot pose from environment state into demonstrations, enabling the policy to autonomously handle execution deviations. Second, to address critical, state-breaking (OOD) failures, we introduce a "Reset" pipeline. We leverage an MLLM for offline failure analysis to automatically identify OOD states from execution videos. This analysis enables the efficient, targeted collection of a small library of object-centric "Reset" skills, which are trained to restore the environment to a task-valid state. Our full framework integrates these learned policies. At inference, an online MLLM monitor arbitrates between task execution and "Reset" skills. Experiments on challenging, contact-rich manipulation tasks show our approach significantly improves task success and robustness.
Paperid: 3937,   Poster  
Authors: Guangzhao Dai, Shuo Wang, Zihan Wang, Guo-Sen Xie, Yang Yang, Jinshan Pan, Qianru Sun, Xiangbo Shu
Title: History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
Abstract: Visionand-Language Navigation in Continuous Environment (VLN-CE) requires an agent to follow language instructions to navigate the target destination. With the advancement of large language models (LLMs), recent efforts have explored adapting them for zero-shot VLN-CE, offering a promising solution in addressing the drawbacks of poor generalization in the training-based paradigm. However, existing LLM-based works primarily perform naive reasoning for decision-making and lack feedback, e.g., reviewing historical errors and predicting future potentials. Consequently, it may suffer from continuous failure for those initial error tasks. In this paper, we rethink LLM-based zero-shot VLN-CE and propose a new paradigm, named EvoNav, to improve the agent's decision-making with future thought and history experience via Future Chain-of-Thought (F-CoT) and History Chain-of-Experience (H-CoE). F-CoT predicts future actions and landmarks as thoughts to assist navigation progress estimation and direction selection, while H-CoE summarizes historical trajectories and scenes as experience to improve navigation decision reliability. Both F-CoT and H-CoE cooperatively evolve the agent’s decision-making. Extensive experiments in both the simulator and real-world environments demonstrate the effectiveness of our EvoNav. Source code will be released.
Paperid: 3938,   Poster  
Authors: Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das
Title: PECCVAI : Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
Abstract: By 2026, up to 90% of online content may be synthetically generated, raising serious concerns about the spread of AIgenerated disinformation. Policymakers and companies alike are responding California’s Bill AB 321 mandates watermarking of AI-generated media, while firms like Meta and Google are deploying watermarking systems to curb the misuse of generative models. Yet, watermarking techniques remain fragile. In this work, we introduce and analyze a novel vulnerability: the visual paraphrase attack, a generative method capable of stripping both visible and invisible watermarks from AI-generated images. The attack operates in two steps: first, a caption is generated for an image. Then, the image and its caption are passed to a diffusion-based text-to-image system, producing a visually similar but watermark free image. Our empirical evaluation demonstrates that visual paraphrasing reliably removes watermarks while preserving the original image’s semantic content, revealing a fundamental weakness in current watermarking systems. To address this, we introduce PECCAVI, the first watermarking method explicitly designed to withstand visual paraphrase attacks. PECCAVI embeds robust, distortion-free watermarks within semantically stable regions of the image, which we term Non-Melting Points (NMPs). The method uses multi-channel frequency domain watermarking and incorporates noisy burnishing to obfuscate watermark locations and resist reverse engineering. PECCAVI is model-agnostic and significantly more durable than existing approaches. We release the first visual paraphrase benchmark dataset and open-source all code and resources1, offering a foundation for future work on robust watermarking in the age of generative AI.
Paperid: 3939,   Poster  
Authors: Chengyao Qian, Jing Wu, Trung Le, Dinh Phung, Mehrtash Harandi
Title: pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning
Abstract: Machine Unlearning (MU), erasing undesirable content from Artificial Intelligence (AI) models, plays an essential role in developing safe and trustworthy AI systems.Despite notable advances, the baseline MU methods rely on retraining from scratch without the data to be removed, which is computationally expensive and financially prohibitive.To address this challenge, we propose a simple yet efficient trainingfree and retain-set-free MU algorithm designed explicitly as a diagnostic baseline: Machine Unlearning pH-Test (MUpHT).It is designed to serve as a practical evaluation reference for future MU methods.Our method eliminates the low dimensional subspaces associated with undesirable concepts from the space spanned by the model's weight vectors, thereby rendering the model ``blind" to these undesirable contents. Additionally, we extend our retain-aware variant to handle entangled features by leveraging a generalized Rayleigh quotient over the undesirable and retain sets, enabling an efficient tradeoff between preserving retained knowledge and suppressing undesirable knowledge.Our method enables evaluation of MU across diverse visual tasks, including concept erasure for classification, image generation, and multimodal applications.By producing an unlearned model instantly from only a few samples, our method serves as a quick litmus test for MU.
Paperid: 3940,   Poster  
Authors: Zixun Sun, Yubo Dong, Hehe Fan, Yi Yang
Title: Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision–language understanding, yet processing long visual token sequences remains computationally expensive. Existing approaches mitigate this cost by reducing image tokens, either by discarding them after the visual encoder or by pruning them in the early Transformer layers of the LLM. While these strategies improve efficiency, they inevitably discard informative visual content and risk degrading downstream reasoning performance. To address this challenge, we introduce HiLo Prune, a training-free pruning strategy that is built on a simple principle: look at what you will lose before you remove it. Instead of directly dropping tokens, Hi-Lo Prune first identifies which tokens are safe to prune through a coarse-to-fine selection process, and then encourages the model to absorb their information before pruning occurs. Specifically, the framework consists of three stages: (1) Hierarchical Pruning Token Selection. After visual encoding, we apply a coarse-to-fine process that identifies tokens to retain and selects a critical set of pruning candidates from redundant ones. (2) Attention-Guided Candidate Token Merge. Before removing selected tokens, an attention mechanism is applied to the early LLM layers, which explicitly transfers information from these candidates to the retained tokens. (3) Low-informative Candidate Token Removal. At a designated Transformer layer, the pruned tokens are removed, reducing computation for all subsequent layers. This design enables aggressive early-layer pruning while preserving critical visual cues. Experiments on Qwen2-VL, Qwen2.5-VL, and Qwen3-VL demonstrate that Hi-Lo Prune consistently outperforms existing pruning methods across multiple benchmarks, achieving strong performance even under high pruning ratios without any fine-tuning. The code has been submitted as supplementary material and will be made publicly available.
Paperid: 3941,   Poster  
Authors: Michael Hubbertz, Qi Han, Tobias Meisen
Title: Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
Abstract: Deep learningbased online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map topologies. We propose metrics based on evaluation subsets that control for geographical proximity and topological similarity between training and validation scenes. We introduce Fréchet distance–based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: an input-feature overfitting score quantifying the performance drop when geographic cues disappear, and a topology overfitting score measuring degradation as scenes become topologically novel. Beyond models, we analyze dataset biases and contribute topology-aware diagnostics: A minimum-spanning-tree (MST) diversity metric for training sets and a symmetric coverage metric to quantify topological similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that topology-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and topology-centric dataset design for deployable online mapping.
Paperid: 3942,   Poster  
Authors: xulun ye, Qin Zhang, Kun Zhou
Title: LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
Abstract: LanguageGuided 3D segmentation is crucial for linking 3D perception with semantic understanding, yet it remains vulnerable to the incomplete and occluded views common in real-world RGB-D data. To overcome this, we present a real-time framework that leverages 3D Gaussian Splatting (3DGS) to build a semantically continuous and differentiable embedding field from partial observations. Our approach integrates two key components: a Dirichlet Process (DP) for the adaptive discovery of novel object categories, and a gradient low-rank mechanism that enhances class separability by reducing feature redundancy. This combination enables robust open-vocabulary segmentation guided directly by text prompts. Extensive experiments on challenging benchmarks demonstrate that our method achieves strong performance, exhibiting superior accuracy, robustness to incomplete inputs, and a powerful capacity for novel class discovery.
Paperid: 3943,   Poster  
Authors: David Tschirschwitz, Volker Rodehorst
Title: K$\alpha$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
Abstract: Progress in computer vision relies on the interplay of data, algorithms, and computation. For foundational tasks such as object detection, supervised learning with humanannotated data remains the state-of-the-art approach. However, this "gold-standard" data is notoriously error-prone, which is a fundamental bottleneck that hinders both model training and evaluation. As a result, benchmarking improvements have become negligible or non-existent in the last year. This issue does not stem from algorithms or computation, but from problem specifications and the dataset creation process. This ultimately leads to ill-defined tasks with noisy labels. Although statistical methods for Inter-Annotator Agreement (IAA) exist, they are often applied inconsistently and lack standardization, which makes dataset quality comparisons unreliable.We propose a unified meta-algorithm for dataset quality evaluation called K\alphaLOS (Krippendorff's \alpha Localization Object Sensing) that serves as a tool for dataset creation and final assessment. Our framework conceptually incorporates existing methods and extends upon them. This provides a broader scope, as our method applies to any combined localization and classification task. It provides greater analytical depth than competing methods, enabling downstream tasks such as evaluating intra-annotator-consistency, rater vitality, and localization sensitivity. Crucially, it is modular, flexible, and extensible, allowing components to be interchanged for specific use-cases and enabling comparability across datasets and tasks.Validating such a metric is challenging, as no "real" ground truth exists. Typically, what we evaluate is considered the ground truth and starting point in the modeling process. Prior validation often relies on heuristics or machine-generated labels that fail to capture the complexity of real annotation noise. Therefore, we introduce an experimental validation approach using an empirical noise generator from real, multi-annotated datasets, which also scrutinizes heuristic assumptions about the noise distribution.
Paperid: 3944,   Poster  
Authors: junyuan ma, Xunzhi Xiang, Wenbin Li, Qi Fan, Yang Gao
Title: Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
Abstract: Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for crossdomain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, causing overfitting prone under retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layerwise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixelwise predictions, producing consistent masks. Together, these stages form a hierarchical select–regularize–calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state-of-the-art by more than 4.1 mIoU across multiple CD-FSS benchmarks.
Paperid: 3945,   Poster  
Authors: David Skuddis, Vincent Ress, Wei Zhang, Vincent Ofosu Nyako, Norbert Haala
Title: BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
Abstract: We present BEVSLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our new self-supervised approach leverages bird’s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns a learnable set of global landmark coordinates with per-frame heatmaps, yielding consistent detection and reliable occurrence across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and outperforms state-of-the-art methods. Code and trained models will be released after publication.
Paperid: 3946,   Poster  
Authors: Kanchana Vaishnavi Gandikota, Michael Moeller, Andreas Kolb, Bhaskar Choubey, Paramanand Chandramouli
Title: A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
Abstract: We introduce a fundamentally new paradigm in video sensing, 1bit computational video, that redefines the limits of imaging efficiency and performance. Instead of the conventional high-bit-depth capture, we show that one bit measurements captured by time-varying thresholding can be used to reconstruct full-bit-depth videos, eliminating the need for power-hungry, high-precision analog-to-digital conversion at the sensor as well as reducing the energy consumption in data transmission. We propose thresholding strategies to effectively capture spatiotemporal dependencies in video streams. Despite the radical data compression at acquisition, we recover full-bit-depth videos with high fidelity through neural video reconstruction using a transformer-based neural network. Our method unlocks significant gains in memory efficiency, power savings, and data throughput reduction at the sensor, making it ideal for imaging systems with ultra-low-power requirements or high-speed video capture. We validate our framework on the task of recovering both standard and high-speed videos from simulated 1-bit measurements. Our work redefines the camera pipeline, potentially paving the way for gigapixel, kilohertz imaging systems on low-power sensor hardware.
Paperid: 3947,   Poster  
Authors: Linghui Fu, Yuhan Liu, Hao Chen, Zhen Yang, Yongjian Deng
Title: One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation
Abstract: Video Frame Interpolation (VFI) is a crucial task in video processing. Flowbased methods, despite their success, are constrained by a fundamental dilemma: forward warping is efficient but prone to artifacts, while backward warping yields higher quality at a significant computational cost, especially for multi-frame interpolation. This trade-off is a major bottleneck. To overcome this, we introduce ``One-Shot Flow, Any-Time Frame," a novel framework for Event-based VFI (E-VFI) that achieves both high efficiency and superior quality for arbitrary-time interpolation. Our framework uniquely computes a comprehensive motion trajectory representation in a single pass using a Bidirectional Flow Estimation Block (BiFEB), leveraging the high temporal resolution of event data. Subsequently, our Flow Query (FQ) module can instantly retrieve the bidirectional optical flow for any timestamp, enabling the generation of any number of frames without repeated computation. Finally, a novel Bidirectional Warping (BiW) mechanism intelligently fuses the strengths of both warping directions, effectively mitigating artifacts and producing high-fidelity results. Extensive experiments show that our approach consistently surpasses state-of-the-art E-VFI methods in both reconstruction quality and inference efficiency, representing a substantial advance in efficient and high-quality event-based video interpolation.The code will be released after acceptance.
Paperid: 3948,   Poster  
Authors: Yunlong Gao, Wenxin Liang, Guanglu Wang, Senqi Guan, Linlin Zong, Dongyu Zhang, Xinyue Liu
Title: TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
Abstract: Fully Fewshot Class-incremental Audio Classification (FFCAC) is challenging since the training samples are limited both in the incremental sessions and in the base session. Existing few-shot learning methods suffer from catastrophic forgetting and overfitting when applied to FFCAC.Pre-trained Audio-Language Models (ALMs) have achieved success in many audio learning tasks. However, we find that it is impractical to directly use ALM on FFCAC, since misalignment between text and audio causes even severe catastrophic forgetting and overfitting. We propose a Task-Adaptive Prototype Evolution (TAPE) framework to facilitate ALMs to tackle the challenges of FFCAC, which consists of two key components:(1) A Task-Adapter that isolates audio features in a metric space to mitigate catastrophic forgetting while preserving knowledge across sessions, and (2) A Prototype Evolution mechanism that dynamically refines class prototypes using query samples during inference, thereby enabling adaptive learning and reducing overfitting.To the best of our knowledge, we are the first to use ALMs on the FFCAC task. We conduct experiments on three audio datasets: NSynth-100 (instrument recognition), FSC-89 (event detection), and LBS-100 (voice recognition). The experimental results show that our proposed approach TAPE significantly surpasses the baselines. Specifically, it averagely improves upon the second best from 54.93% to 82.76% in terms of Average Accuracy (AA \uparrow), and from 28.74% to 12.56% in terms of Performance Dropping rate (PD \downarrow).
Paperid: 3949,   Poster  
Authors: Rouyi Zhou, 漾之 吴, Jiajun Wen, Can Gao, Feng Liu, Zhihui Lai, Linlin Shen
Title: Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition
Abstract: The graph convolutional network has been an important tool for skeletonbased action recognition. However, existing graph models predominantly utilize self-attention mechanisms to model feature correlations between the joints of each sample, which not only neglects dynamic relation dependencies in temporal dimension but also leads to redundant computation as well as to the difficulty in establishing a unified framework for joint relation representation. To address these problems, this paper develops a Mamba-based graph convolution network (Gamba) with dynamic graph topology learning. Specifically, in order to capture local motion patterns through aggregation of intra-class information, a classification-based Mamba module is developed to categorize motion joints into distinct types. To the best of our knowledge, this is the first work to assign motion joints with label information to facilitate correlation learning. To capture the underlying relation of the joints of different categories, the state space model is introduced to the proposed method to process enhanced temporal features, aiming at learning dynamic adjacency matrices for long-range dependencies of the joints across different categories. The proposed framework not only facilitates an adaptive focus on the spatio-temporal feature modeling, but also has less computation complexity than traditional self-attention-based approaches. Extensive experiments on the public NTU RGB+D 60/120 and NW-UCLA benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art methods in recognition accuracy. The proposed framework provides new insights into effective and efficient skeleton-based action recognition and can be potentially applied to a variety of real-world applications.
Paperid: 3950,   Poster  
Authors: Chengyin Hu, Xin wang, Rui Qiu, Zhe Jia, Yingying Zhao, Kai Wang, Xu Kang, Yiwei Wei
Title: Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain
Abstract: Infrared pedestrian detection is crucial in safetycritical systems but remains vulnerable to adversarial attacks. Existing physical attacks often rely on fixed, static patterns. However, they often lack robustness across scales, as their hand-crafted or uniformly generated structures are fundamentally limited by a fixed receptive field and fail to adapt to varying distances and scene contexts. In light of this, we propose AdvFractal, a black-box attack that exploits the innate self-similarity and structural richness of fractal geometry to naturally generate multi-scale, physically realizable adversarial perturbations. By modeling perturbations with H-type fractals and optimizing parameters via Particle Swarm Optimization, AdvFractal seamlessly coordinates attacks across scales, progressively disrupting detector features from local textures to global shapes. Experiments show AdvFractal achieves an attack success rate (ASR) of 97.54% in the physical domain and 99.16% cross-dataset, significantly outperforming state-of-the-art methods. The perturbations are highly effective in the infrared spectrum while remaining stealthy in visible light, offering a novel approach for evaluating and understanding the security of infrared detection systems.
Paperid: 3951,   Poster  
Authors: Hantao Qi, Yan Yan, Junlong Gao, Hanzi Wang
Title: Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition
Abstract: Adapting Vision–Language Models (VLMs) to fewshot action recognition (FSAR) often trades accuracy for stability: task-specific gains can trigger catastrophic forgetting of domain-general knowledge and reduce inter-class margins. In few-shot episodes, each query is contrasted with only one positive class and a few negatives, so the text encoder sees limited prompt diversity and rarely observes hard counter-examples near decision boundaries. We propose Protect-to-Adapt (P2A), a parameter-efficient fine-tuning method with two complementary modules. Orthogonal Subspace Control (OSC) estimates a principal semantic subspace of the pre-trained backbone and constrains low-rank updates to its orthogonal complement, preserving domain-general semantics while allowing task-specific adaptation. Ranked Negative-prompt Curriculum (RNC) uses a large language model to generate verifier-filtered negative prompts with increasing difficulty. These class-specific hard counter-examples enlarge margins and sharpen decision boundaries under few-shot conditions. With only 2% of backbone parameters trainable, P2A achieves state-of-the-art performance on five FSAR benchmarks and substantially reduces catastrophic forgetting in a cross-dataset continual-learning setting where the model is adapted sequentially to multiple video datasets without replay.
Paperid: 3952,   Poster  
Authors: Han Xue, Nan Min, Xiaotong Liu, Wendi Chen, Fang Yuan, Jun Lv, Cewu Lu, Chuan Wen
Title: Rethinking Camera Choice : An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
Abstract: The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wristmounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning.
Paperid: 3953,   Poster  
Authors: Minhyeok Lee
Title: When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
Abstract: Neural networks are often treated as monolithic black boxes that process all inputs uniformly through all layers. However, researchers intuitively wonder: do simple images require all 50 layers of ResNet50, or is the prediction effectively decided much earlier? We investigate when pretrained models make up their minds during a forward pass by training linear probes at each layer of ResNet variants on ImageNet, without modifying the base model. Our findings reveal substantial computational heterogeneity across architectures: ResNet-50 and ResNet-101 exhibit mean decision depths of 5.5--5.6 layers (k=2 stability), while ResNet-18 requires deeper relative processing at 7.4 layers. We discover pronounced bimodal patterns with distinct populations of early and late deciders, where 39--43% of samples in deeper ResNets achieve stability within the first third of the network, while 39--54% require processing beyond 70% depth. The decision layer is highly sensitive to stability criteria, with mean depths increasing from 2.6--4.1 (k=1) to 9.0--10.0 (k=4). Linear probe accuracy exhibits sharp jumps in final residual stages, reaching 73--75% for ResNet-50/101 and 65% for ResNet-18, indicating that semantic consolidation occurs late. These findings expose computational heterogeneity in standard inference and provide actionable guidance for early exit strategies.
Paperid: 3954,   Poster  
Authors: Yongru Chen, Kai Zhang, Zeliang Zong, Yuchen Lu, Wenming Tan, Ye Ren, Jilin Hu
Title: One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs
Abstract: Large VisionLanguage Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.
Paperid: 3955,   Poster  
Authors: Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang, Qiang Zhang
Title: Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
Abstract: As a key technique in multimodal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals.Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.
Paperid: 3956,   Poster  
Authors: Gyojin Han, Junmo Kim
Title: Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
Abstract: We address textbased 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.
Paperid: 3957,   Poster  
Authors: Sangwoon Kwak, WEEYOUN KWON, Jun Young Jeong, Geonho Kim, Won-Sik Cheong, Jihyong Oh
Title: MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
Abstract: Recent advances in 4D Gaussian Splatting (4DGS) have extended the highspeed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naïve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts.We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance.To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap_\textLR. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets.Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations. The code and project page will be publicly released.
Paperid: 3958,   Poster  
Authors: Jae Yun Lee, Hyeok Nam, Sung In Cho
Title: Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
Abstract: Sourcefree domain adaptation (SFDA) adapts a pre-trained source model to an unlabeled target domain using only the model itself, typically relying on pseudo labeling augmented with auxiliary knowledge and consistency regularization (CR) mechanisms to alleviate noise in the generated pseudo labels. However, existing approaches overlook the geometric structure of the target embedding manifold when assigning pseudo labels, resulting in unreliable distance measurements and consequently severe mislabeling.Moreover, their CR is applied solely to output logits, making it insensitive to feature-level reliability. To solve these issues, we propose a novel pseudo labeling scheme based on geometry aware-universe feature space and a new gravity CR loss.Our pseudo labeling strategy first models the embedding space with virtual features to make geometry-aware universe feature space. On this space, pseudo labels are generated through feature traversal, which propagates labels only from statistically reliable regions. In addition, the proposed CR jointly encourages logit- and feature-level consistency, aligning predictions for augmented images while preserving the geometric structure of the embedding space. It further modulates the strength of CR for each sample, preventing the confirmation of noisy pseudo labels through a gravity-based force defined between two input embeddings.Experiments on Office-Home, DomainNet-126, and VisDA-C demonstrate consistent improvements over prior SFDA methods, and incorporating gravity CR loss into baselines yields substantial additional gains.
Paperid: 3959,   Poster  
Authors: Naresh Kumar Devulapally, Shruti Agarwal, Vishal Asnani, Vishnu Lokhande
Title: Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion
Abstract: Crafting prompts via Prompt Engineering that steer a model’s internal representations toward specific and predefined outcomes can be time-consuming, often requiring multiple iterations. Hard Prompt Inversion offers a complementary workflow: start from a reference image and generate a prompt that conditions a text-to-image (T2I) model to reconstruct the reference image. Existing inversion methods either yield incoherent text, or produce prompts that are overly sensitive to downstream token edits. We propose a dLLM-based prompt inversion framework that yield prompts that are (i) more interpretable to humans, (ii) better aligned with the reference image, and (iii) designed for downstream token swap and token append operations (aka edit-friendly prompts). The method is plug-and-play, requiring no finetuning of either the T2I model or the dLLM. Experiments across three datasets show a ~10× reduction in inversion time relative to existing prompt-inversion baselines, higher interpretability scores, and significantly higher prompt editability, as measured by TIFA, GPT-V preference scoring, and controlled user studies, all while preserving high-fidelity image generation. By coupling diffusion-time sampling with token-similarity control inside a dLLM decoder, our approach extends prompt inversion beyond reconstruction to downstream token-editing tasks, enabling faster, more transferable prompts that generalize across multiple T2I models.
Paperid: 3960,   Poster  
Authors: Xuewei Zhou, Yajie Meng, Pan Zeng, Xianfang Tang, Feifei Cui, Qiangguo Jin, Jialiang Yang, Junlin Xu
Title: TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Abstract: Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing selfsupervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive Alignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations.Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2%) and strong cross-domain generalization (AUC: 83.1%), outperforming existing uni-modal and multi-modal baselines methods.
Paperid: 3961,   Poster  
Authors: Zhixin Cheng, Bohao Liao, Jiacheng Deng, Xiaotian Yin, Xinjun Li, Yujia Chen, Baoqun Yin, Tianzhu Zhang
Title: Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment
Abstract: Both detectionthen-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Module (ZRCA). HZRS employs reinforcement learning to resolve the non-differentiable issue of selecting high-value matching regions, while ZRCA improves region alignment through three stages: understand, coordinate, and accelerate.Extensive experiments and ablation studies on RGB-D Scenes v2 and 7-Scenes demonstrate the superiority of our network, establishing it as the state-of-the-art for image-to-point cloud registration.
Paperid: 3962,   Poster  
Authors: Matthieu Dabrowski, Ouala JEMAA, Benjamin Allaert
Title: HUMAPS-4D : A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
Abstract: Current advancements in human motion understanding are strongly reliant on video data. Nevertheless, privacy regulations and operational constraints increasingly restrict the use of visual data in realworld scenarios. Inferring posture through wearable sensors, such as instrumented insoles measuring plantar activation, presents itself as a promising alternative. However, the absence of large-scale multimodal datasets hinders the rigorous benchmarking of these methodologies. We introduce HUMAPS-4D, a novel multimodal dataset designed for human motion analysis, effectively bridging computer vision and biomechanics. This dataset integrates synchronized motion capture, multi-view video, IMUs, plantar pressure signals, sEMG activation patterns, and high-level semantic annotations. The data was collected from 32 subjects performing 30 actions over a total duration of 14 hours. Participants demonstrate substantial anthropometric variability (age, body proportions, and morphology), which supports robust generalization across diverse body types. Distinct from existing resources, this collection offers a unique pairing of low-level physiological signals and high-level human motor descriptors. This capability enables the development of generative and inference models conditioned by both physical and semantic constraints, while simultaneously reducing the reliance on personally identifiable visual data. We establish benchmark tasks specifically targeting posture reconstruction from plantar pressure, semantic motion segmentation, physics-informed motricity analysis, and multimodal fusion under privacy-preserving conditions. The dataset, along with its associated annotation tools and visualization utilities, is scheduled for online release soon.
Paperid: 3963,   Poster  
Authors: Ofir Shahar, Gur Elkin, Ohad Ben-Shahar
Title: The Missing GAP: From Solving Square Jigsaw Puzzles To Handling Real World Archaeological Fragments
Abstract: Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cuttingedge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces, demonstrating superior performance on GAP, comparing to both classic and recent prominent works in this domain.
Paperid: 3964,   Poster  
Authors: jing yang, Sen Yang, Boqiang Duan, Ming Dai, Wei Zhang, Xiao Tan, KunbinChen KunbinChen, Wei He, Jingdong Wang, Hanli Wang
Title: Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
Abstract: Recently, multimodal large language models (MLLMs) have achieved remarkable success in general multimodal tasks. Increasing attention has been given to leveraging MLLMs for finegrained visual understanding, such as region-level captioning and pixel-level grounding.However, most existing approaches are task-specific, and some recent unified approaches attempt to handle both types simultaneously; they still fall short of deeply exploring the underlying associations across tasks. To bridge this gap, we propose a multimodal large language model designed to jointly support Fine-grained visual understanding through Consistency Learning (FCLM). The central idea of this work is that pixel-level captioning and grounding are mutually beneficial and complementary tasks, each enhancing the other in achieving a fine-grained understanding of visual content.Specifically, FCLM analyzes the representation features -- visual prompt and segmentation tokens -- required for the two types of visual tasks, and achieves advanced reasoning and perception through a novel-designed consistency learning loss and a two-stage training framework. Moreover, we design a Hybrid Region Extractor to enhance the quality of visual prompt embeddings, thereby obtaining more semantically discriminative representations for detailed caption generation. Additionally, to verify the MLLM’s ability to localize accurate targets from detailed textual descriptions, we introduce a novel task called Detailed Localized Referring Expression Segmentation (DL-RES).We conduct extensive experiments on seven visual understanding tasks, demonstrating the strong performance and generalization ability of FCLM.
Paperid: 3965,   Poster  
Authors: Bo-Yuan Sun, Bo-Wen Yin, Yuan-Ming Li, Xihan Wei, Qibin Hou
Title: See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Abstract: We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable finegrained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference.Our cross-attention analysis of pretrained multimodal large language models (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text–visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks.
Paperid: 3966,   Poster  
Authors: Hao Yuan, Jiabin Zhang, Yajing Wu, Ruixuan Pang, Jing Li
Title: HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading
Abstract: Color grading is central to High Dynamic Range (HDR) video production, shaping the perceptual tone, contrast, and luminance of content across diverse displays. However, evaluating HDR color grading quality is particularly difficult due to its semantic, contentdependent nature and the lack of large-scale annotated data. While pre-trained Vision–Language Models (VLMs) offer strong semantic priors and generalization ability, their exposure is limited to Standard Dynamic Range (SDR) data, making them poorly equipped to handle HDR photometry and perceptual nuances. We propose HDR-VLM, the first method to adapt a VLM to the HDR domain for perceptual quality assessment. Specifically, HDR-VLM employs a two-stage design: it first bridges the domain gap using a unified HLG-based encoding and progressive adaptation; then it aligns model assessments with noisy, multi-scale human preferences via reinforcement learning with curriculum-inspired rewards. Experiments on a real-world, production-sourced HDR dataset show that HDR-VLM not only outperforms existing quality assessment methods but also produces interpretable attribution rationales. These rationales offer actionable guidance for content creators, enhancing the reliability and transparency of automated HDR quality evaluation.
Paperid: 3967,   Poster  
Authors: Yujun Liu, Ruisheng Wang, Xiang Ao, Haoyuan Shen, Kuihao Wang, Kun Zhou, Qingquan Li
Title: Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
Abstract: Building reconstruction aims to extract compact wireframes from point clouds. Recent edgebased methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments on the large-scale Building3D dataset demonstrate that our powerful and versatile GREO integrates seamlessly into existing edge-based methods as a plug-and-play training strategy, achieving state-of-the-art performance on both the Entry-level and Tallinn benchmarks while adding no inference overhead.
Paperid: 3968,   Poster  
Authors: Yizhou Huang, Genze Jiang, Yihua Cheng, Kezhi Wang
Title: FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration
Abstract: Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attentionbased architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with \mathcalO(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
Paperid: 3969,   Poster  
Authors: Wei Liu, Li Yang, Yufei Wang, Han Xiao, Boyu Cai, Weiming Hu
Title: Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for More Discriminative Spiking Features
Abstract: Spiking Neural Networks (SNNs) naturally process visual inputs across multiple timesteps, offering rich temporal dynamics and energyefficient computation. However, the temporally invariant supervision commonly used in training tends to reinforce the same dominant response patterns across timesteps, leading to redundant representations and limiting temporal discriminability.To overcome this limitation, we introduce \emphTemporal Representation Enhancement (TRE), a novel learning-to-forget paradigm that encourages more diverse and complementary temporal representations. TRE identifies high-contribution semantic patterns through class-specific contribution estimation and temporal accumulation, and selectively suppresses them using a dynamic modulation strategy. By redirecting the model’s attention toward alternative yet informative semantic cues, TRE promotes the learning of complementary features across timesteps.This approach not only strengthens the temporal discriminative capacity of SNNs but also enables more effective multi-timestep learning by leveraging richer semantic information. Extensive experiments on both static image datasets and dynamic neuromorphic datasets demonstrate that TRE consistently improves classification accuracy and feature diversity across different SNN backbones.
Paperid: 3970,   Poster  
Authors: QIU QI, Xuan Wu, Jiawei Peng, Yuan Miao, Xu Yang, Yanlong Du
Title: TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas
Abstract: Video highlight detection aims to identify the most engaging segments in longform videos, supporting content editing and recommendation, especially for movies and TV dramas. However, existing methods are ill-suited to cinematic content due to its narrative complexity, while the scarcity of annotated data and the high cost of manual labeling further hinder progress. To bridge this gap, we introduceTVHighlights, the first large-scale dataset tailored for video highlight detection in movies and TV dramas, with 1,721 carefully curated videos covering diverse genres. Built on community-driven behaviors, it provides realistic and diverse annotations without human labeling. Based on TVHighlights, we proposeLTV-HD: a LLM-guided, human-free collaborative training framework for video highlight detection in cinematic content. LTV-HD operates in two stages: (1) weakly supervised pre-training of a lightweight model using video-level labels, followed by (2) iterative refinement through collaboration between large language models (LLMs) and the lightweight model. LLMs generate noisy clip-level pseudo-labels, which the lightweight model learns from under a noise-robust strategy, and its high-confidence predictions are then fed back to guide the LLM in distilling genre-specific highlight patterns through a self-improving loop. Experiments demonstrate that LTV-HD achieves state-of-the-art performance on TVHighlights, validating its effectiveness in real-world, annotation-free scenarios.
Paperid: 3971,   Poster  
Authors: Jiawen Li, Jiali Hu, Xitong Ling, Renao Yan, Yuxuan Chen, Tian Guan, Yonghong He
Title: Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction
Abstract: Conventional whole slide image (WSI) analysis pipelines follow a twostage process. First, an image encoder, such as a vision transformer (ViT), is used to perform batched offline feature extraction on a series of tiles cropped from the WSI. Second, a multiple instance learning (MIL) model is trained with slide-level labels to obtain task-specific slide embeddings. However, several limitations exist: strong reliance on pre-trained weights of the tile encoder, the absence of receptive fields from the original image, and a lack of task-independent WSI representations. An ideal improvement would be to develop an end-to-end pre-trained WSI model, but training it from scratch will face challenges such as high training costs and computational complexity. In this work, we deconstruct the key steps of ViT-based pathology image representation and propose a conversion strategy called E2E-ViT, which transforms a vanilla ViT into an end-to-end pre-trained WSI model without introducing additional parameters. E2E-ViT directly inputs the entire tissue region in WSIs to efficiently feed image sequences into the transformer backbone, achieving information interaction from the original receptive fields and generating slide features. Through multiple survival prediction tasks, we demonstrate that transformed pre-trained ViTs outperform two-stage MIL models and slide foundation models (SFM). Our work presents a new end-to-end learning paradigm that provides a promising direction for the next generation of computational pathology models.
Paperid: 3972,   Poster  
Authors: Imanol Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva
Title: Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
Abstract: Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a selfsupervised framework that bridges this gap using a paired generative–discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers.LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning.On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines.Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content. Code will be released upon acceptance.
Paperid: 3973,   Poster  
Authors: Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu
Title: UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
Abstract: Dexterous manipulation remains challenging due to the cost of collecting realrobot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision–language–action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset of 10M paired image–pointcloud–action frames and over 50K trajectories across eight dexterous hands (6–24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand–object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function–Actuator–Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human–robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.
Paperid: 3974,   Poster  
Authors: Linhua Cong, Bingrui Sima, Kun He
Title: Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks
Abstract: Multimodal Large Reasoning Models (MLRMs) exhibit remarkable performance on complex tasks by incorporating explicit multistep reasoning. However, this capability also introduces new security vulnerabilities. Existing jailbreak studies largely overlook Cognitive-level weaknesses embedded in the reasoning process itself. In this work, we uncover a critical cognitive bias in MLRMs: the anchoring effect, where safety judgments are disproportionately influenced by the first piece of information received—the anchor. Building on this finding, we propose the Reasoning-chain Anchoring Attack (RA-Attack), a novel jailbreak framework that fully exploits this vulnerability. RA-Attack employs a cross-modal safe anchor, whose core component is a structured visual mind map. This structured format provides the model with a pre-established, safety-biased reasoning chain that subtly induces it to rationalize and execute subsequent harmful intent. Extensive experiments across seven leading closed- and open-source MLRMs demonstrate the effectiveness of RA-Attack, achieving state-of-the-art jailbreak success rates—92% on Gemini-2.5-Pro and 82% on GPT-4o. Our findings reveal that cognitive biases can be systematically exploited to manipulate multimodal reasoning chains, establishing cognitive security as a critical and underexplored frontier in AI safety research. Warning: This paper contains unsafe examples.
Paperid: 3975,   Poster  
Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong.Huang Wuxiong.Huang, Hesheng Wang
Title: GaussianDWM: Driving World Model using Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation
Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multimodal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided token sampling strategy that removes redundant 3D information and injects accurate and compact 3D tokens into textual understanding.Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, OmniDrive-nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub.
Paperid: 3976,   Poster  
Authors: Jinrong Zhang, Zhaoyang Xu, Xusheng He, Xinrui Xinrui, Na Zheng, Jianlong Wu
Title: Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
Abstract: Highquality pixel-level responses remain a major bottleneck for multimodal large language models (MLLMs) in regional perception. Existing approaches generally attach regression decoders to MLLM features, achieving strong grounding performance but compromising end-to-end design and increasing training costs. Researchers have applied parameter and data scaling to improve pure MLLMs’ ability to generate pixel coordinates in natural language, yet the performance gains on grounding tasks remain markedly weaker than those in standard QA tasks. Our analysis shows the primary bottleneck is that conventional scaling fails to effectively enhance the key reasoning stage required for pixel-level regional perception. To address this, we propose R-Ground, a reasoning framework for MLLM-based grounding built upon a multimodal Monte Carlo Tree Search algorithm. R-Ground leverages structured reasoning actions, multimodal feature alignment scoring, and regional feature weighted voting to perform scaling at the designated reasoning stage. Extensive experiments demonstrate that R-Ground achieves effective reasoning scaling, enabling a 7B MLLM to match or even surpass a 72B model on the grounding task. The code will be released upon acceptance.
Paperid: 3977,   Poster  
Authors: yusheng dai, Zehua Chen, Yuxuan Jiang, Qiuhong Ke, Jianfei Cai, Jun Zhu
Title: Omni2Sound: A Fundamental Study on Dataset, Base Model, and Benchmark for Unified Video-Text-to-Audio Generation
Abstract: Training a unified model for the generation of videoto-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) offers significant flexibility, but is hindered by critical and unexplored challenges. We identify two foundational problems: (1) the scarcity of high-quality audio captions that feature a tight A-V-T alignment, leading to severe semantic conflict in multimodal training data, and (2) cross-task and intra-task competition during joint multi-task training, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduceSoundAtlas, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. Powered by a novel, multi-turn agentic annotation pipeline (using advanced foundation models) that operates cost-effectively, SoundAtlas features a tight A-V-T alignment and a much lower hallucination rate than existing datasets. Second, we proposeOmni2Sound, a diffusion-based unified VT2A model that supports flexible modality combinations. To address cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we constructVGGSound-Omni, a comprehensive benchmark for unified evaluation of VT2A, V2A and T2A, including challenging off-screen tracks. As a result, with a vanilla DiT backbone, Omni2Sound achieves unified state-of-the-art performance in all three tasks within a single model. It also demonstrates strong generalization across multiple benchmarks with different caption and video styles. Demonstrations are provided in the Appendix.
Paperid: 3978,   Poster  
Authors: Wenjing Tang, Chuanguang Yang, Zhulin An, Libo Huang, boyu diao, Yongjun Xu
Title: EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
Abstract: Visual place recognition (VPR) faces critical challenges in handling extreme environmental variations while meeting the computational constraints of practical applications. Current methods predominantly address these challenges by either scaling up model capacity or employing computationally intensive reranking stages, creating a significant efficiency bottleneck. To overcome this limitation, we propose EfficientVPR, a lightweight onestage framework that achieves unprecedented speed-accuracy trade-offs through two key innovations: i) a scene-aware visual prompt tuning method which adapts pretrained features with less parameters while dynamically adjusting to sample-specific characteristics, and ii) an instance-dependent key local feature enhancement module that further reinforces discriminative regions. Comprehensive evaluations on Pitts250k, MSLS, Eynsham, AmsterTime and SVOX demonstrate that our method establishes a new SOTA for DINOv2-small models by outperforming all same-scale competitors, and delivers a 73× speedup with 60% lower-dimensional features while maintaining competitive (within 2.5% average R@1 gap) against the SOTA DINOv2-large-based two-stage method. The code is available in Supplementary Material.
Paperid: 3979,   Poster  
Authors: Hong Gao, Xiangkai Xu, Bin Zhong, Junjie Yin, Fangyu Kang, Yutong Xu, Xiugang Dong, Xurui Gao, Min-Ling Zhang
Title: SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
Abstract: SpatioTemporal Video Grounding (STVG) requires models to localize objects both spatially and temporally. Despite recent progress, existing methods struggle with complex and fine-grained spatial semantics in language descriptions, leading to error propagation from temporal to spatial grounding stages. We identify that this fundamental limitation arises from the absence of iterative refinement between temporal and spatial predictions. To address these challenges, we propose SARL-STG, the first RL-based framework for STVG. It progressively refines spatio-temporal grounding through multi-stage optimization, leveraging reinforcement learning to enable dynamic interaction between temporal and spatial modules, where spatial grounding quality serves as feedback to improve temporal localization. Specifically, SARL-STG contains: (1) a unified architecture that seamlessly integrates a pretrained MLLM for temporal reasoning with an open-vocabulary detector for spatial localization, (2) a hierarchical RL training strategy that progresses from coarse temporal to fine-grained spatio-temporal optimization, and (3) a spatial knowledge-injected reward mechanism that uses spatial grounding confidence as discriminative signals for temporal refinement. To facilitate training at scale, we also construct STVG-Wild, a large-scale dataset with diverse spatio-temporal annotations. Experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmarks (HCSTVG, VidSTG, Charades-STA, etc.), significantly reducing error accumulation and enhances both temporal and spatial grounding accuracy.
Paperid: 3980,   Poster  
Authors: Lakshmikar Reddy Polamreddy, Ming Ma
Title: CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
Abstract: Accurate and interpretable medical image segmentation remains a major challenge, as existing deep learning models primarily optimize pixellevel accuracy while overlooking positional reasoning—an essential component for automated report generation and clinical interpretability. We introduce CG-Reasoner, a novel centroid-guided cross-modal framework that jointly performs medical image segmentation and positional reasoning. CG-Reasoner integrates a multimodal large language model (LLM), a newly designed light-weight encoder–decoder architecture, and a Text2Centroid module that predicts lesion centroids from reasoning embeddings—enabling the model to produce both accurate segmentation masks and spatially coherent, clinically meaningful reasoning explanations. Furthermore, we propose PRScore (Positional-Reasoning Score), a robust evaluation metric that jointly measures the spatial and semantic alignment between generated reasoning text and segmentation masks. Experiments on six medical datasets across different imaging modalities demonstrate that CG-Reasoner achieves state-of-the-art performance, offering precise segmentation, spatially coherent reasoning, and clinically interpretable visual-textual explanations within a unified framework. The source code is available at https://github.com/lpmm2025/CG-Reasoner.
Paperid: 3981,   Poster  
Authors: Sha Tao, Jiao PAN, Yu Guo, Chao Yao
Title: Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
Abstract: Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective subregion analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.
Paperid: 3982,   Poster  
Authors: Ray Zhang, Carl Greiff, Thomas Lew, John Subosits
Title: Generalized-CVO: Fast and Correspondence-Free Point Cloud Registration in RKHS with Second Order Riemannian Optimization
Abstract: We propose a fast and correspondencefree point cloud registration method that leverages local geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The proposed method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order methods used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR registration task in the driving domain, we achieve a reduction of >55% in both translational and rotational drift in challenging feature-sparse environments.
Paperid: 3983,   Poster  
Authors: Yuyang Hong, Jiaqi Gu, Yujing Lou, Lubin Fan, Qi Yang, Ying Wang, Kun Ding, Yue Wu, Shiming Xiang, Jieping Ye
Title: CC-VQA: Conflict- and Correlatoin-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
Abstract: Knowledgebased visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose CC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3% to 6.4% compared to existing methods. Code will be released upon acceptance.
Paperid: 3984,   Poster  
Authors: Yongkang Zhang, Dongyu She, Baiyu Ji, Qichuan Geng, Zhong Zhou, Yan Wang
Title: VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation
Abstract: The rapid evolution of generative AI, exemplified by models such as Sora, has intensified the threat of video misinformation. A critical challenge in detecting these AIgenerated video misinformation lies in a fundamental disconnect between existing datasets and practical deception tactics. Current datasets often disrupt cross-modal consistency through editing techniques, resulting in unrealistic and easily detectable artifacts. In stark contrast, generative video misinformation strives for semantic consistency across modalities to remain realism. To address this gap, we introduce RAVM: the first Realistic AI-Generated Video Misinformation Detection Dataset. RAVM contains authentic claim-video pairs, as well as Realistic AI-Generated claim-video pairs. More importantly, unlike existing Video Misinformation Detection (VMD) datasets that are limited to single-source manipulations, RAVM encompasses multiple manipulation sources—Claim, Video, Audio, and Cross-Modal Manipulation—each of which includes multiple manipulation techniques to generate realistic AI-generated video misinformation. Thus, we introduce an AI-generative framework for producing realistic AI-generated video misinformation. Furthermore, we propose IEEG model, which represents multimodal evidence, fact-checking results, and their dependencies as an evidence graph for interpretable AI-generated VMD. Extensive experiments on RAVM demonstrate the vulnerability of general Multimodal Large Language Models (MLLMs) in detecting generative video misinformation, while our IEEG achieves state-of-the-art performance on RAVM.
Paperid: 3985,   Poster  
Authors: Tianhao Han, HaoYang ZHANG, Liang Xie, Haochen Chang, Kun Gao, Yuan Cheng, Pengfei Ren, Erwei Yin
Title: UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
Abstract: Manually annotating accurate 3D hand poses is extremely timeconsuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multiview consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diversity hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multiview and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).
Paperid: 3986,   Poster  
Authors: Xingru Huang, Shuanghua Ye, Zhao Huang, Wenwen Tang, Huiyu Zhou, Zhiwen Zheng, Jin Liu, Xiaoshuai Zhang
Title: CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation
Abstract: Precise 3D medical image segmentation is a clinical cornerstone for diagnosis, therapy planning, and longitudinal monitoring. However, routine acquisition with anisotropic voxel spacing and heterogeneous reconstruction induces downsampling aliasing and crossscale misalignment that blur boundaries, fragment topology, and undermine reliability. Existing U-shaped CNN or Transformer designs neither control alias injection at decimation nor explicitly align high-resolution evidence before decoder fusion, leading to unstable interfaces under device and protocol variability. We introduce the Coset-fibRated micrO-local co-attention Network (CROWn), a general segmentation framework that couples sampling theory with representation learning to jointly suppress aliasing and calibrate cross-scale fusion. CROWn comprises two complementary components. The Microlocal Polyphase Co-Attentive Decimator (\muPCAD) performs axis-aware polyphase analysis with pooled–subband co-attention and explicit anti-alias low-pass, routing boundary-relevant high-frequency evidence while attenuating spurious phase components during downsampling. The Octaphase Coset Fibration (OCF) anti-aliases high-resolution skips, restructures them via 3D space-to-depth into cosets, and applies phase attention with edge-gated modulation to deliver compact, phase-aligned, boundary-aware features to the decoder. Extensive evaluations across 15 publicly available datasets spanning CT, MRI, and OCT demonstrate CROWn's state-of-the-art performance against 17 recent leading methods, improves overlap and topological consistency, consistently reduces boundary errors, while maintaining controlled training and inference cost. The code is publicly available.
Paperid: 3987,   Poster  
Authors: Xinyi Wang, Pengfei Ren, HaoYang ZHANG, Hanling Zhan, Yingxi Li, Liang Xie, Yue Gao, Erwei Yin
Title: MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
Abstract: 3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in humancomputer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating the hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D HPE from Sparse IMUs (MGDHand). We first pre-train a MANO-IMU fusion model as a teacher to encode static geometric morphology prior, dynamic kinematic prior and temporal motion prior. Then, a Multi-Granularity Decoupled Distillation (MGDistill) scheme is proposed to bridge the semantic gap. MGDistill includes a Static Shape Distillation module to transfer time-invariant hand shape priors, and a Dynamic Pose Distillation module to transfer complex joint kinematics and dense pose priors. Additionally, a Temporal Motion Distillation module transfers the fast-changing motion priors (velocity and acceleration). Extensive experiments on public dataset demonstrate that our method outperforms state-of-the-art approaches under sparse IMU configurations.
Paperid: 3988,   Poster  
Authors: Xianglin Qiu, Jian Wang, Xiaolei Wang, Zhen Zhang, Jimin Xiao
Title: Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
Abstract: Contrastive LanguageImage Pre-training (CLIP) offers a new paradigm for Weakly Supervised Semantic Segmentation (WSSS) by generating Class Activation Maps (CAMs) from text-image alignment. Existing methods primarily rely on hand-crafted templates or general attribute descriptions generated by a large language model to construct text prototypes for querying visual features. However, these strategies faces two major limitations: the inherent modality gap in CLIP prevents text prototypes achieving tight alignment with visual features; and their static text prototypes cannot adaptively respond to target instances that exhibit diverse visual attributes. To address these challenges, our key insight is to directly construct instance-specific visual description prototype as query, thereby bypassing the suboptimal static text description optimization. To this end, we propose the Visual Description Assembly (VDA) framework. It employs a probabilistic model to map complex CLIP visual features into a structured latent space. This latent space allows us to explicitly disentangle and aggregate varied visual attributes, and then dynamically assemble them into instance-specific visual prototypes. Furthermore, to enhance the robustness of this prototype, we adaptively incorporate the semantically stable text prototype into it as the final query for generating superior CAMs. Experimental results show our method outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks. Code will be released.
Paperid: 3989,   Poster  
Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Title: SJD++: Accelerating Speculative Jacobi Decoding for Text-to-Image Models via Multi-Drafting and Enhanced Rejection Stability
Abstract: Speculative Jacobi Decoding (SJD) provides a compelling, draftmodel-free approach to accelerating autoregressive text-to-image synthesis. However, the high-entropy nature of image generation leads to low acceptance rates of draft tokens in high-complexity image regions, where rejections occur frequently. This bottleneck restricts SJD’s practical efficiency and limits overall throughput. To address the bottleneck, we introduce SJD++, an enhanced speculative Jacobi decoding framework. First, SJD++ integrates a well-designed multi-drafting strategy to improve local acceptance rates when generating these challenging regions. Furthermore, we propose an adaptive rejection mechanism that enhances sequence stability by continuing validation instead of reverting to full resampling after an initial mismatch. These key optimizations work in tandem to significantly increase the average accepted token length, boosting overall inference speed while strictly preserving the integrity of the target output distribution. Experiments on text-to-image benchmarks demonstrate that SJD++ achieves 3.8× acceleration with lossless image quality.
Paperid: 3990,   Poster  
Authors: Tuan Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Cam-Tu Nguyen, Minh Khoi Nguyen, Dang Nguyen, Anton van den Hengel, Johan Verjans, Le Nguyen, Vu Minh Hieu Phan
Title: Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
Abstract: Large visionlanguage models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers.Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention and (ii) they fail to exhibit meaningful semantic alignment with any visual region.Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.
Paperid: 3991,   Poster  
Authors: Viktor Zaytsev, Olena Vynokurova, Pavlo Tytarchuk, Dmytro Kozii, Vitalii Pohribnyi, Olga Radyvonenko, Artem Shcherbina
Title: PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
Abstract: Table structure recognition in document AI faces significant challenges due to layout inconsistencies, merged cells, and complex nested structures, which is further exacerbated by the scarcity of large, diverse annotated datasets. In this paper, we present PIXTAB (Efficient PIXel-Precise TABle Structure Recognition Approach) that provides exact, pixel-level structure using a small, lightweight model that can run on-device. The approach is language-agnostic, as it allows adding support for a new languages simply by replacing the Optical Character Recognition (OCR) model without modifying to the core structure recognition model. Key innovations include: position-aware pixel-precise tokens for deterministic cell reconstruction; speculative decoding for faster sequence generation, and training-only box supervision to stabilize spatial grounding; region-based image segmentation. To mitigate data scarcity we propose a pipeline for generating a large synthetic table dataset. Experimental results validate each component. To address the limitations of existing evaluation methods we introduce TEDS_struct100 and TEDS_100 metrics. Speculative decoding approach significantly improves recognition speed while maintaining accuracy. Finally, the combined techniques enable a mobile-optimized model that is more than three times faster than the full-size version.
Paperid: 3992,   Poster  
Authors: Mukhiddin Toshpulatov, Wookey Lee, Suan Lee, Geehyuk Lee
Title: Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
Abstract: Precise fingertip contact detection is a fundamental challenge for natural and immersive virtual reality (VR)interaction. However, existing visionbased methods suffer from insufficient accuracy, with typical depth errors(12-25mm) being too large to reliably distinguish between hovering and true contact (<3mm). While commercial motioncapture systems provide sub-millimeter accuracy, their prohibitive cost limits widespread adoption. This paperaddresses this critical gap by developing a highly accurate and cost-effective system for fingertip contact detection.We introduce a novel, specialized dataset of 53,300 RGB-depth pairs capturing millimeter-scale, hand-table typinginteractions. By systematically fine-tuning six state-of-the-art depth estimation architectures on this dataset, wereduce the mean absolute error (MAE) by 68%, from 12.3mm to a state-of-the-art 3.8mm. Our complete VR keyboard system,TapBoard-X, achieves 95.96% contact detection accuracy and enables typing speeds of 45.6 WPM with a low 3.1% charactererror rate, rivaling physical keyboards. This performance is achieved at over a 90% cost reduction compared tocommercial systems, democratizing high-precision hand tracking for the broader research community and paving the wayfor the next generation of tactile VR experiences.
Paperid: 3993,   Poster  
Authors: Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao
Title: Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Abstract: VisionLanguage Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM’s transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs’ internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks---including +18.2% on All-Angles Bench and +29.0% on VSI-Bench---all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
Paperid: 3994,   Poster  
Authors: Haojie Yan, Zehao Chen, Yan Liu, Shi Gu, Peng Lin, De Ma, Huajin Tang, Qian Zheng, Gang Pan
Title: eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
Abstract: Perception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and lowlight frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance–illumination representation within the 3DGS pipeline, our method reconstructs normal-light radiance fields with fine-grained details and accurate color. Extensive experiments on both synthetic and real datasets demonstrate that eRetinexGS achieves state-of-the-art performance in low-light scene enhancement while maintaining real-time rendering capability. The code and dataset will be released upon publication.
Paperid: 3995,   Poster  
Authors: Hyeonseo Jang, Jaebyeong Jeon, Joong-won Hwang, Kibok Lee
Title: Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
Abstract: Testtime prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have revealed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines---without modifying any other components---is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and avoids any additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code will be released.
Paperid: 3996,   Poster  
Authors: Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang
Title: See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
Abstract: Recent advances in VisionLanguage Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework ForeSight, which enables VLMs to See Further with low-level visual cues and Think Deeper with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Paperid: 3997,   Poster  
Authors: Han Ling, Quansen Sun, Yinghua Yao, Ivor Tsang, Yinghui Sun
Title: SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement
Abstract: Although depthassisted scene flow estimation has advanced rapidly, mainstream dense frameworks (e.g., RAFT-3D) still rely primarily on 2D feature correlations to optimize 3D motion fields, which hinders their ability to exploit 3D structural priors effectively and consequently limits robustness in complex scenes. We present SEA-Flow3D, a simple, efficient, and accurate framework for dense scene flow estimation.At its core lies a Spatial Vector Sampling (SVS) module that jointly samples 3D coordinates and correlation volumes within the local neighborhood of matched points, producing a direction-aware correlation representation with explicit spatial vectors and providing strong geometric guidance for subsequent optimization. Following the simplicity-and-efficiency principle, SEA-Flow3D adopts a RAFT-style multi-scale recurrent refinement architecture, integrating an RNN-based optimizer with context-guided upsampling to achieve higher accuracy with fewer iterations. Extensive experiments on KITTI and Sintel demonstrate that SEA-Flow3D achieves state-of-the-art performance while maintaining remarkable efficiency and a lightweight design.
Paperid: 3998,   Poster  
Authors: Zhiyu Li, Dianmo Sheng, Qi Chu, Shilong Chen, Tao Gong, Zhou Wei, Nenghai Yu
Title: CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
Abstract: InContext Learning (ICL) has shown great effectiveness in developing generalist image segmentation models. Its significant advantage over text-based descriptions is the ability to convey intricate visual appearance details through simple reference images. However, finding a perfectly matching single example for real-world rare and complex concepts is difficult;and existing methods are largely confined to semantic or instance-level understanding of the reference image, struggle to express more precise segmentation needs through the input. To address this, we propose CDICS, a novel framework that leverages Compositional prompts and phased task Decoupling to achieve compositional prompt-controlled In-Context Segmentation. Our method introduces compositional prompts derived from reference prompts, combining semantic, part and color images to dynamically define segmentation targets. To effectively fuse this control information, ensure synergy while suppressing interference, and mitigate feature coupling risks, our decoupled two-stage architecture, which firstly performs coarse-grained semantic localization, then refines the result using compositional appearance prompts to precisely match the specified attributes. This design extends traditional in-context segmentation, enabling it to support compositional prompts. Additionally, we reconstructed two datasets and their benchmarks to acquire data with part-color-specific attributes. Our method demonstrates superior performance on the compositional prompt-controlled in-context segmentation task. It also extends the capabilities of existing in-context segmentation, and makes an attempt toward real-world fine-grained segmentation.
Paperid: 3999,   Poster  
Authors: TingJia Zhang, Bo Chen, Shengzhong Liu, Fan Wu, Guihai Chen
Title: CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling
Abstract: Recent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in largescale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism to eliminate viewpoint redundant rendering stages. Furthermore, it refactors rasterization tasks with a dedicated kernel to mitigate tile load imbalance, significantly boosting GPU utilization. Extensive experiments demonstrate that CaT-GS achieves a speedup of up to 10× over the original 3DGS and up to 70% over previous state-of-the-art methods, establishing a new benchmark for high-fidelity, real-time rendering of large-scale scenes.
Paperid: 4000,   Poster  
Authors: Jinkai Zheng, jiaqing wei, Xinxiang Jin, Yaoqi Sun, Xichun Sheng, Ming Li, Liangqiong Qu, Xinchen Liu, Wu Liu
Title: HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
Abstract: In recent years, the gait parsing sequence has become increasingly popular due to its higher information entropy than the binary silhouette and the keypointbased skeleton. However, existing parsing-based gait recognition methods have not fully explored the complex, non-linear relationships between features at different positions, semantic, and temporal dynamics levels, i.e., higher-order correlations. To unleash the power of parsing between human body parts and temporal dynamics, this paper proposes a novel hypergraph-based gait recognition framework, named HyperGait. The HyperGait contains a global head and two elaborately-designed modules. In particular, the Spatial Hypergraph Convolutional Module (SHCM) and the Temporal Hypergraph Convolutional Module (THCM) are designed to explore the high-order spatial-level and temporal-level features, respectively.The SHCM extracts fine-grained relationships between human body parts through the hypergraph.The THCM performs the high-order temporal information between temporally related human body parts.Comprehensive experiments on two large-scale gait datasets, i.e., Gait3D and SUSTech1K, show the superior performance of our proposed HyperGait.In highly challenging real-world scenarios, with only parsing as input, our HyperGait achieves the Rank-1 accuracy of 80.5% on the Gait3D dataset.
Paperid: 4001,   Poster  
Authors: Ruonan Zhao, Zheng Wang, Debin Liu, shijie lv, Laurence Yang
Title: FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices
Abstract: Personalized Federated Learning (PFL) has gained significant attention for enabling participating clients to train customized personalized models on nonIID local data. However, current PFL methods mainly suffer from two limitations: 1) Only the personalized part supports heterogeneous design, while the shared part must remain homogeneous. 2) The semantic representations of models generated on different clients with non-IID data characteristics inevitably tend to be inconsistent, negatively impacting model performance. To conquer them, this paper proposes a novel resource-adaptive personalized Federated Learning via Anchor-driven Representation Alignment (FedARA). Concretely, we design a low-rank decomposition and reconstruction fusion scheme for shared feature extractors based on the matrix decomposition technology, where each client can autonomously set the rank value based on its locally available resources, controlling the complexity of extractors and naturally reducing communication and computational costs. Moreover, to address the inconsistency of feature spaces across clients, an anchor-driven representation consistency learning mechanism is developed, which can guide client models to learn unified feature representations and alleviate global knowledge forgetting, thereby improving personalized model performance. Extensive experimental results demonstrate that our method significantly outperforms seventeen state-of-the-art baselines in diverse heterogeneous scenarios with less communication and computational costs.
Paperid: 4002,   Poster  
Authors: Hang Shi, Ruocheng Yang, Wenjie You, huangzhilin huangzhilin, Daoqiang Zhang, WEI SHAO
Title: Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics
Abstract: Spatial transcriptomics (ST) has emerged as a revolutionary approach in the field of tissue analysis that can offer spatial resolved molecular insights for the identification of anomalous regions (AR) on human cancers. Current STbased methods for detecting AR focus narrowly on the molecular features of local tissue spots, overlooking the matched bulk RNA-seq data that contains crucial diagnostic information. This oversight limits their effectiveness in identifying subtle or heterogeneous tumors, where accurate detection depends on broader genetic context. Besides the genomic signatures, the pathological images can also provide rich visual information to reflect the morphology of AR. To utilize the patient-level diagnostic knowledge and harness complementary information from both histology images and ST, we develop a Bulk RNA-seq Guided Multi-modal Anomalous Regions Detection method (BRGMAR) for the identification of AR from human tissues. Specifically, to effectively model the dependencies in ST, we introduce a Dynamic Multi-Relational Graph Learning (DMRGL) module to adaptively capture complex relationships in ST, including both spatial proximity and gene expression similarity. Then, we design an Optimal Transportation-based Gene Module Alignment (OTGMA) approach to align ST data with patient-level bulk RNA-seq data by matching the compositional and functional similarities of their corresponding gene modules. Finally, we combine the learned genomic features with pathological image representations for accurate AR detection. We evaluate our method on three public available ST datasets for the purpose of identifying cancerous regions from normal tissues, and the experimental results demonstrate the advantage of our method in comparison with the existing studies.
Paperid: 4003,   Poster  
Authors: Canyu Mo, Yongxiang Liu, Jiehua Zhang, Zilong Yu, Zhen Liu, Tianpeng Liu, Li Liu
Title: ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition
Abstract: Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a dataefficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a novel RSFM that effectively integrates the generalizable representations of DINOv3 with a dedicated mechanism for exciting local contrast information. ORSATR-X comprises two core components: (1) a DINOv3 encoder, which provides rich feature representation under limited RS pretraining data, and (2) a carefully designed side network incorporating a Weber Local Adapter (WLA) and a Multi-scale Aggregation Module (MSAM). The WLA enhances discriminability of low-contrast boundaries in complex scenes through center-surround contrast and directional gradient information enhancement, while the MSAM handles inherent object scale variations in RS imagery by adaptive aggregation of features across multiple scales. Furthermore, we pretrain the side network using an efficient self-supervised distillation strategy. Extensive experiments on scene classification, object detection, and semantic segmentation demonstrate that ORSATR-X achieves state-of-the-art performance among existing RSFMs, demonstrating the effectiveness of our design.
Paperid: 4004,   Poster  
Authors: Yifeng Bai, Zhirong Chen, Bo Song, Erkang Cheng, Haibin Ling
Title: TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
Abstract: Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instancelevel learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, these approaches often neglect the importance of point-to-instance (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in \mathrmDET\_\textl, +5.4 in \mathrmTOP\_\textll on subset_A and +11.0 in \mathrmDET\_\textl, +7.9 in \mathrmTOP\_\textll on subset_B, validating the effectiveness of the proposed components. The code will be shared publicly upon paper acceptance.
Paperid: 4005,   Poster  
Authors: Ke Li, Bolin Song, Hongbo Liu
Title: Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
Abstract: Selfsupervised monocular depth estimation has achieved remarkable progress in recent years, yet frequency aliasing and the lack of fine-grained cross-frame motion modeling still lead to blurred depth boundaries and suboptimal camera motion estimation.To address these challenges, we propose a progressive self-supervised framework that integrates a Frequency-Guided Depth Network (FGDepth) and a PoseQuery Network (PQNet). FGDepth incorporates a plug-and-play Frequency-Guided Sampling module that explicitly enhances high-frequency details and suppresses aliasing artifacts, producing depth maps with sharper boundaries. PQNet employs channel-aligned attention to model fine-grained cross-frame motion features, enabling more accurate and robust camera motion estimation. Furthermore, we design a progressive three-stage decoupled training strategy that effectively leverages the complementarity between depth and pose estimation, further improving overall performance.Extensive experiments on the KITTI benchmark demonstrate state-of-the-art performance, achieving a 4.1% reduction in Sq Rel over strong baselines, and our method also exhibits excellent cross-dataset generalization on Make3D. Ablation studies further validate the effectiveness of each proposed component.
Paperid: 4006,   Poster  
Authors: Haodong Jing, Dongyao Jiang, Jixin Wang, Junhao Jia, Yanshu Li, Yongqiang Ma, Nanning Zheng
Title: More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
Abstract: Exploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectively learning the feature distribution of Brain-3D and providing initial control for 3DGS. The Multi-View Stabilizer (MVS) overcomes the challenge of capturing multi-view changes of 3D objects, creating more robust viewpoint representations. Comprehensive experiments and discussions on fMRI/EEG show the SOTA performance (2.936 FPD, 0.202 LPIPS) of BrainGS, providing reliable neural interpretations, offering new insights into brain stereovision understanding.
Paperid: 4007,   Poster  
Authors: Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao
Title: SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
Abstract: Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on highresource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering~(TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question–answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive progress in document and scene text understanding from a global perspective.
Paperid: 4008,   Poster  
Authors: Chade Li, Haida Feng, Pengju Zhang, Yihong Wu
Title: MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
Abstract: Current querydriven 3D understanding methods are constrained to static point clouds, limiting their ability to reason about dynamic scenes. To bridge this gap, we propose MORE-STEM, a unified framework for Long-Short MemOry REcall and Spatio-TEmporal Consistency Model in Query-Driven 3D/4D Point Cloud Segmentation. The framework first introduces a Cross-Frame Text-Visual Alignment module that establishes fine-grained, time-aware correspondences between linguistic queries and dynamic 3D features. Building on this, a Spatio-Temporal Consistency Model module enforces motion-aware coherence across consecutive frames, ensuring stable and temporally consistent segmentation. A Long-Short Memory Recall module further enhances cross-scene reasoning through hierarchical memory that balances long-term semantic recall and short-term adaptation. We also construct a new outdoor benchmark for both 3D and 4D instruction segmentation with temporally aligned, motion-centric text annotations. Experiments demonstrate that MORE-STEM achieves state-of-the-art performance across multiple 3D and 4D understanding tasks.
Paperid: 4009,   Poster  
Authors: Rongchao Zhang, Chengxin Li, Yiwei Lou, Yuling Shi, Hanpin Wang, Yu Huang
Title: Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge
Abstract: Phenotypic Response Simulation (PRS) has long been a fundamental task in quantitative biology and highthroughput screening, with the potential to accelerate therapeutic development and elucidate disease mechanisms beyond empirical clinical practice. However, the vast perturbation space poses challenges to the discriminative formulation, and existing generative approaches tend to concentrate on the same trajectory subspace, making their generated paths prone to drift. To fill these gaps, we propose a novel Steered Diffusion Bridge approach and named SimuSDB to define deterministic probabilistic trajectories between two distinct state domains for cell response generation. SimuSDB consists of two iterative processes: i) extending the diffusion bridge paradigm to maintain stochasticity and diversity in interpolation trajectories by introducing Brownian bridges and ii) generating cell morphologies that comply with phenotypic constraints, while allowing the latter to explicitly guide the generative process. For the challenging second stage, which involves incorporating diverse morphological constraints and phenotype rules, we formalize the rule-guided sample generation task as an optimal control problem within a stochastic dynamical system. This way, the generative model can achieve analytically tractable optimal control strategies and steered generation without collapsing toward the trajectory of the same data subspace. Comprehensive experiments demonstrate the superior performance of SimuSDB.
Paperid: 4010,   Poster  
Authors: Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead
Title: WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with A Million Realistic Tasks
Abstract: We present WebGym, the largest opensource environment for training realistic visual web agents to date. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 1 million tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) algorithm, REINFORCE, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To speed up sampling rollouts for RL training, we develop a high-throughput asynchronous rollout system, designed specifically for web agents, that achieves a 4-5x rollout speedup compared to naive implementations, enabling us to train at scale on a diverse set of tasks. With this setup, we fine-tune strong vision-language models, such as Qwen-3-VL-8B-Instruct, on the training tasks from WebGym, which results in an improvement in success rate on an out-of-distribution test set from 21.8% to 28.5%, outperforming a "proprietary" GPT-4o-based agent and closing the gap to a GPT-5-Thinking agent that achieves 31.8%. This improvement is significant because our test set consists only of tasks on websites never seen during training, demonstrating generalization for web agents. We provide both the task breadth and system throughput for large-scale RL on web agents.
Paperid: 4011,   Poster  
Authors: Mijeong Kim, Jungtaek Kim, Bohyung Han
Title: GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes
Abstract: We present GP4DGS, a probabilistic framework for monocular video reconstruction that models the motion of 4D Gaussian Splatting (GS) primitives using variational Gaussian Processes (GPs). In contrast to prior approaches that depend on manually designed motion priors, our kernel-based probabilistic formulation enables flexible, data-adaptive motion modeling while implicitly providing appropriate priors for unobserved regions. GP-4DGS employs variational GPs with spatial kernels to capture geometric correlations and periodic kernels to characterize temporal dynamics, achieving efficient scalability to large sets of primitives compared to standard GPs. To train GP-4DGS, we introduce an optimization strategy that jointly optimizes GS primitive parameters as well as GP hyperparameters, establishing a complementary relationship between probabilistic and geometric modeling. Beyond improved reconstruction quality, our variational GP formulation naturally supports uncertainty quantification and temporal extrapolation beyond the input sequence. Experiments on challenging dynamic scenes demonstrate that GP-4DGS delivers high-quality reconstructions, robustly handles severe occlusions and extreme viewpoints, and provides principled uncertainty estimation and extrapolation.
Paperid: 4012,   Poster  
Authors: Xianbing Zhao, Lan Luo, Heng-yang Lu, Buzhou Tang
Title: Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
Abstract: Multimodal Sentiment Analysis (MSA) aims to integrate textual, acoustic, and visual information to predict sentiment polarity. With the emergence of Large Language Models (LLMs), existing studies commonly employ learnable queries to compress audio–visual representations and feed them as soft prompts into LLMs for MSA. However, due to the implicit learning mechanism of the learnable queries, these learnable queries lack explicit guidance regarding how each query encodes sentiment semantics. To address this issue, we propose a prototypeas-prompt framework that maps audio–visual representations into a fixed set of multimodal sentiment prototypes. These prototypes are then used as soft prompts to guide the LLM in performing MSA. Concretely, we first compress both textual and non-textual features into multimodal prototypes using a resampling-based strategy. We further introduce a sentiment-aware prototype learning that explicitly binds multimodal prototypes with sentiment semantics. To ensure both cross-modal consistency and intra-modal diversity of multimodal sentiment prototypes, we design a cross-modal prototype alignment constraint and a distance-weighted prototype diversity constraint. Extensive experiments across three LLMs and four benchmark datasets show that PaP achieves superior performance with only 0.09%–0.26% of trainable parameters, highlighting its effectiveness and parameter efficiency.
Paperid: 4013,   Poster  
Authors: Muhammad Kamran Janjua, Hugo Luis Andrade Silva, Di Niu, Bahador Rashidi
Title: Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Abstract: Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these toolgenerated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P^2), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P^2 consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P^2 raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15–40% absolute gains from P^2, surpassing prior agentic, supervised, and RL-based tool-use methods—without any training or model modifications.
Paperid: 4014,   Poster  
Authors: Honglin Xiong, Chenjie Zhu, Jianbiao Ding, Zixuan Ni, Wei Li, Zhenpeng Mi, Qian Wang
Title: Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
Abstract: Realworld photographic post-processing is a formidable challenge due to the frequent co-occurrence of multiple, coupled image degradations. Current paradigms, such as monolithic "all-in-one" models, often face generalization bottlenecks, while recent agent-based systems suffer from time-consuming, sequential tool invocation and suboptimal coordination of isolated, single-task tools. To overcome these limitations, we propose a novel and efficient framework: a vision-language agent system for universal photographic post-processing. Our system employs a powerful Vision-Language Model (VLM) as an orchestrator agent to perform nuanced user intent understanding and in-depth degradation analysis. Based on its assessment, the VLM generates a structured plan, dynamically allocating weights to a suite of specialized expert LoRA modules. These experts, which adapt only the Key (K) and Value (V) matrices for enhanced composability, are then simultaneously merged into a pretrained diffusion backbone to execute a tailored restoration. To ensure perceptually optimal weights, we introduce a lightweight allocation branch trained on the VLM's features using Direct Preference Optimization (DPO) from human feedback. This dynamic fusion paradigm enables a synergistic, context-aware restoration in a single, efficient forward pass. Our method demonstrates state-of-the-art performance across a wide range of synthetic and real-world datasets with diverse degradations. Crucially, it exhibits remarkable zero-shot generalization, achieving excellent results on real-world data. Our code and weights will be made publicly available.
Paperid: 4015,   Poster  
Authors: Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal, Bo-Hsiang Tseng
Title: From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
Abstract: Human level agentic intelligence transcends lowlevel geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks effectively evaluate this foundational geometric perception capabilites of multimodal LLMs, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1500 expert-annotated questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: 1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and 2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model's ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.
Paperid: 4016,   Poster  
Authors: JangHyeon Lee, Philipe Ambrozio Dias, Yao-Yi Chiang, Dalton Lunga
Title: Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
Abstract: Learning generalpurpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, not all task-relevant information is shared between modalities, and retaining modality-specific (unique) features can improve task performance. Prior methods retain unique information through extra training objectives or databases, increasing training complexity and computation. Motivated by the conventional wisdom that earlier layers capture general input features while later layers become task-specific, we hypothesize that early layers in CL models consist unique information that is lost toward the final layer. Through a comprehensive layerwise analysis of modality gap, representation similarity, and mutual information, we confirm this trend and find that fusing intermediate (more unique) and final (more shared) representations outperforms state-of-the-art models across diverse geospatial benchmarks. Our findings reveal underutilized information diversity in CL models and show that simple layerwise fusion is an efficient path to richer geo-embeddings.
Paperid: 4017,   Poster  
Authors: Vladislav Pyatov, Gleb Bobrovskikh, Saveliy Galochkin, Nikita Boldyrev, Oleg Voynov, Alexander Filippov, Gonzalo Ferrer, Peter Wonka, Evgeny Burnaev
Title: CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
Abstract: We introduce CADFS, a datacentric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-and-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations, obtained via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that FeatureScript, the expanded operation set, and representation-aligned textual descriptions all significantly improve performance. Our framework substantially broadens the complexity and realism achievable in generative CAD.
Paperid: 4018,   Poster  
Authors: Sudong Cai, Shuai Yuan, Bingzhi Chen, Rui Mao, Bing Wang
Title: Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Abstract: Selfattention with separate pre- and post-projections can be a universal approximator (on compact domains) under mild conditions.Yet we observe a striking gap: an attention-only Transformer (w/o FFN layers) exhibits a marked accuracy drop relative to its standard interleaved attention--FFN baseline.We term this the weak-independence challenge of attention.We study this through a new conceptual lens, Selection-as-Nonlinearity (SaN), which interprets effective nonlinearity as directed, cost-constrained selection, offering a coherent account of attention as context-gated activation.In this joint game–decision view, attention performs a resource-constrained cooperative allocation over values: each query distributes a unit-mass weight budget over shared values to optimize representational utility, under a normalizer (e.g., \mathrmsoftmax), and guided by context-derived scores (e.g., q-k similarities).SaN interprets weak-independence as a structural tension: the value weights almost cannot simultaneously attain the decoupled per-query (row-wise) and the per-value (column-wise) optimums under shared budgets, thereby limiting attention's stand-alone capacity.Guided by SaN, we introduce CSaN, an interpretable, efficient attention compensation paradigm with two key insights: 1) hierarchical budget calibration, re-allocate row budgets via inter-query correction signals; and 2) public-private cooperation, enhancing the public attention pathway with a per-token private value pathway to decouple conflicting demands.CSaN is evaluated on various vision benchmarks and demonstrates level-jump gains across popular Transformer families (Swin, ViT, Hiera), enabling models to rival much heavier same-family counterparts ~2× as large.
Paperid: 4019,   Poster  
Authors: Lizhou Lin, Songpengcheng Xia, Zengyuan Lai, LanSun LanSun, Jiarui Yang, Ling Pei
Title: IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
Abstract: Capturing fullbody human motion with object interactions is crucial for AR/VR and robotics applications, yet it remains challenging for conventional vision-based methods due to occlusions and constrained capture volumes. Inertial measurement units (IMUs) offer a compelling alternative without line-of-sight requirements, but existing IMU-based motion capture assumes an isolated human and ignores object contacts and dynamics. To bridge this gap, we present IMU-HOI, a novel framework that jointly recovers full-body human pose and 6-DoF object trajectory from sparse IMUs on the body and object, explicitly modeling human-object Interaction.Our approach first infers probabilistic hand–object contacts directly from IMU streams and uses them as a high-level signal to route between kinematic and inertial reasoning. These contact cues drive a three-stage fusion pipeline that refines human pose and root translation, and fuses hand-based forward kinematics with object-IMU integration for object motion, yielding coherent, drift-resilient trajectories for both human and object. Experiments on challenging human-object interaction scenarios demonstrate substantial accuracy gains over prior inertial motion capture methods. Moreover, IMU-HOI can be plugged into existing sparse-IMU mocap backbones with minimal changes, effectively extending the scope of purely inertial motion capture from isolated humans to full human–object interaction and joint motion estimation.
Paperid: 4020,   Poster  
Authors: Nicolas Stalder, Benjamin F Grewe, Matteo Saponati, Pau Vilimelis Aceituno
Title: A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs
Abstract: The vulnerability of deep neural networks to adversarial examples poses a significant challenge for realworld deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined.Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only ~35% of the training FLOPs, using a model with 50% less parametets, trained with ~33% of the epochs and ~15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2–8× less total compute across ~3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.
Paperid: 4021,   Poster  
Authors: Krzysztof Adamkiewicz, Brian Bernhard Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue, Andreas Dengel
Title: When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Abstract: Recent textto-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression.We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data.Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators.Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.
Paperid: 4022,   Poster  
Authors: Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Xiaoqi Li, Zhuoyang Liu, Ying Li, Renrui Zhang, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
Title: From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Abstract: Vision–Language–Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with longhorizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating long-horizon planning with precise manipulation.Therefore, we aim to endow a VLA model with the capability to infer the “how” process from the “what” outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, visual prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
Paperid: 4023,   Poster  
Authors: Yu Qi, Hongyu Li, Shaofei Huang, Tianrui Hui, Yaxiong Wang, Lechao Cheng, Zhun Zhong, Si Liu, Meng Wang
Title: Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
Abstract: In this paper, we present PSCAVDN, a training-free framework for Aerial Vision-and-Dialog Navigation that integrates a three-stage Parsing-Search-Confirmation reasoning pipeline with a Structured Spatial Memory (SSM) module. The parsing stage converts ambiguous instructions into stable geometric cues, Search-CoT conducts stepwise high-altitude target exploration, and Confirmation-CoT performs fine-grained verification to resolve visual ambiguity and confirm the final target. Meanwhile, SSM integrates multi-scale visual observation, spatial visual memory, and structured geometric memory to provide global spatial context and long-horizon consistency.Extensive experiments on the AVDH and AVDH-Full datasets show that PSC-AVDN sets new state-of-the-art performance in the training-free setting, matching or surpassing several finetuned methods. We believe this framework offers a principled way to combine explicit CoT-style reasoning with structured spatial memory for scalable and generalizable aerial embodied navigation in the future.
Paperid: 4024,   Poster  
Authors: Puria Azadi Moghadam, Ali Khajegili Mirabadi, Behnam Maneshgar, Hossein Farahani, Ali Bashashati
Title: Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
Abstract: Accurate cancer risk assessment is critical for personalized treatment planning. While multimodal models that integrate histopathology with complementary data modalities (e.g., genomics, or clinical reports) exhibit superior prognostic capability, they typically assume full data availability, an unrealistic expectation in realworld clinical settings. In contrast, histopathology slides are routinely collected, universally accessible, and information-rich, making them a practical anchor for robust survival prediction.In this study, we propose a novel framework that leverages histopathology as a basis for outcome prediction, while using other data modalities when training the models.Extensive experiments across eight cancer types and scenarios, including various data modalities, demonstrate that our model outperforms all baselines, with up to 8% gains over methods that solely use histopathology at training time, and a 1.4% gap compared to models that utilize all data modalities. Our model also stratifies patients into meaningful risk groups in 67% of risk stratification scenarios (vs. 50% for best SOTA), generalizes well under varying modality missingness, and matches the best SOTA even with 40% higher rate of missing data during training. It also preserves semantic alignment in zero-shot settings.These results highlight the practical utility and robustness of our approach for real-world cancer risk prediction in resource-limited or modality-incomplete settings.
Paperid: 4025,   Poster  
Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong
Title: BabyVLM v2: Toward Developmentally Grounded Vision–Language Models with Real Infant-View Data and Cognitive Evaluation Benchmarks
Abstract: Early children's developmental trajectories set up a natural goal for sampleefficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision–language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
Paperid: 4026,   Poster  
Authors: Haohong Kuang, Yang Xiao, Changlong Jiang, Jinghong Zheng, Hang Xu, Ran Wang, Zhiguo Cao, Joey Tianyi Zhou
Title: JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
Abstract: In this paper, JUMPHand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information.Existing approaches usually fuse multi-view information by naïve pooling or implicit attention.However, they overlook that each hand joint exhibits varying visibility and reliability across views, which may degrade performance by indiscriminately aggregating noisy or unreliable information.For instance, one joint may be clearly visible in one view, while another joint is occluded in that view but visible in a different view.In contrast, JUMP-Hand addresses this by introducing the core insight of Mixture of Experts (MoE) and regard each 2D view as an expert.The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling, serving as a explicit gating signal to route experts' partial yet complementary clues for each joint in a coarse-to-fine reconstruction paradigm.In this design, uncertainty not only guides the uncertainty-aware triangulation for reliable 3D hand initialization during coarse stage, but also acts as a gating signal during refinement stage to adaptively aggregate multi-scale features from different view experts on a joint-wise basis, enabling robust 3D hand reconstruction.Extensive experiments on DexYCB-MV, HO3D-MV, and OakInk-MV demonstrate that our method achieves state-of-the-art results, validating the effectiveness of the proposed method with joint-wise uncertainty gating for reliable 3D hand reconstruction.The code will be released upon acceptance.
Paperid: 4027,   Poster  
Authors: Yang Zhang, Zhixiang Chi, Xudong Yan, Yang Wang, Songhe Feng
Title: Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank
Abstract: Compositional ZeroShot Learning (CZSL) aims to recognize unseen attribute-object compositions with learned primitives (attribute and object) knowledge from seen compositions. While previous approaches gain their notable performance through the powerful cross-modal alignment of CLIP, they often overlook the modality gap, an inherent constraint stemming from information-imbalanced training data. In this work, we propose SAM, a novel \underline\textSparse \underline\textAlignment and Unimoal \underline\textMemory Bank to effectively bridging modality gap for CZSL. Specifically, we conduct sparse alignment that links textual representations directly to their semantically pertinent visual patches. This direct linking serves to prune redundant visual data and counter the information imbalance in image-text pairs. Subsequently, with the sparsely aligned visual information as its guidance, the visual adaptive condensation module adaptively fuses these critical cues into a unified representation. Finally, we introduce a dynamically updated memory bank that stores samples from both seen and unseen compositions. This bank serves a dual purpose: it bypasses the modality gap through visual-only classification and concurrently strengthens generalization to unseen compositions. Experiments on three benchmarks demonstrate that our method gains significant improvements over CLIP-based methods under closed-world and open-world settings.
Paperid: 4028,   Poster  
Authors: Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Jianan Liu, Siyuan Cao, Xiaohan Zhang, Yiming Li, Zhengzhuang Zhang, Hui-Liang Shen
Title: RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
Abstract: 4D millimeterwave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its state-of-the-art performance. Code will be released.
Paperid: 4029,   Poster  
Authors: Ri Su, Zhao CHEN, Caleb Chen Cao, Lei Chen
Title: URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images
Abstract: Whole slide image (WSI) region retrieval remains an open challenge in computational pathology, as existing methods struggle to represent and preserve information of all possible regions. Current approaches that rely on fixedsize patches or slide-level retrieval are misaligned with real clinical workflows, where pathologists often examine WSI regions of arbitrary orientations and sizes rather than predefined patches or slides. In this work, we redefine WSI retrieval as a semantically optimal matching problem between arbitrary regions under spatial transformations, which necessitates a region-level representation that maintains semantic consistency. To fulfill this requirement, we introduce semantic tessellation, which organizes patch units into flexible, geometry-aware region descriptors. Building on this representation, we develop the affine identifier, a semantic signature that enables rotation- and scale-consistent region matching. We further derive theoretical bounds between the tessellation-derived descriptors and the ideal pixel-level semantic mask objective, showing that they reliably approximate mask-based region similarity. Together, these components form URICA, a theoretically grounded algorithm for robust WSI region retrieval. Experiments on large public datasets demonstrate that URICA achieves strong and consistent performance across diverse WSI retrieval tasks.
Paperid: 4030,   Poster  
Authors: Kai Wang, Tao Zhou, jiayi lei, Jing Wang, Jinman Zhao, Weiguo Pian, Yuan Cheng, Yapeng Tian, Peng Gao, Bin Fu, Yihao Liu, Dimitrios Hatzinakos, Yuewen Cao
Title: Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
Abstract: Generating highfidelity audio that is both semantically meaningful and temporally synchronized with silent videos remains a challenging problem in video-to-audio generation. Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. VisioSonic introduces a multimodal conditioning module that jointly leverages video frames and textual cues to provide semantic and frame-level temporal guidance. A co-attention diffusion transformer efficiently fuses visual and audio representations, enabling content-aware sound synthesis with minimal computation costs. To further enhance alignment beyond supervised training, we introduce Semantic-Temporal Alignment Ranked Direct Preference Optimization (STAR-DPO), a novel preference-learning paradigm that automatically generates audio candidates,ranks them based on both semantic and temporal alignment, and subsequently fine-tunes the diffusion model using the derived preference pairs. Extensive experiments on various benchmarks demonstrate that VisioSonic achieves state-of-the-art audio-video synchronization and audio fidelity while using the fewest trainable parameters among competing approaches.
Paperid: 4031,   Poster  
Authors: Ziyang Wang, Yue Zhang, Mingdao Wang, Yasen Zhang, Teer Song, Yu Tian, Xueming LI
Title: RADAR: VQ-VAE decoder of VAR is a good student for Restoring Against Degradation by Acceleration
Abstract: Visual Autoregressive Modeling (VAR) has recently emerged as a powerful paradigm for image generation that surpasses diffusion models in efficiency and quality. However, accelerating attention computation in VAR is still challenging because attention patterns across scales exhibit strong and complex semantic biases that early coarsescale tokens dominate global structure, while fine-scale tokens mainly refine local details. Existing acceleration methods rely on heuristic token pruning or fixed attention masks, lacking a principled way to balance acceleration and semantic fidelity. In this work, we propose a two-stage acceleration framework for VAR. First, we introduce a semantic-cost-aware masking strategy (SCA-Mask) that quantifies the importance of each attention tile and formulates mask shape design as a cost-constrained optimization problem. This enables adaptive pruning under a given compute budget while preserving essential semantic context. Second, we present Post-Acceleration Adaptation (PAA), a decoder-side fine-tuning scheme that employs internal knowledge distillation to restore image quality from pruned latents. PAA does not require external data and uses a lightweight LoRA-based adaptation, providing a highly efficient alternative to retraining the autoregressive transformer. Comprehensive experiments across multiple VAR tasks demonstrate that our method achieves decent speedup with negligible loss of visual fidelity, yielding a principled and effective pathway toward fast and high-quality visual autoregressive generation.
Paperid: 4032,   Poster  
Authors: Junda Xu, Yanmeng Liu, Xiangqiang Zeng, Jinrong Wu, Ying Qu, Libao Zhang
Title: Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
Abstract: Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Objectbased Multi-stage Alignment Framework (OMAF) that generates high-quality corrected labels with minimal manual intervention. OMAF first employs a prior-regularized self-alignment method to produce high-confidence, object-level offset pseudo-labels, which are then used to train an instance-level offset regression model for label refinement. Experimental results on the challenging Islahiye and Antakya datasets demonstrate that OMAF effectively corrects misalignments and consistently boosts the mIoU of all baseline models by up to 40.6%. The ablation experiments also demonstrated that each module in OMAF effectively improves the final alignment performance. Among them, the self-alignment algorithm contributed 9.22% to the mIoU metric, demonstrating the strong effectiveness of this unsupervised alignment method.This work provides a practical and cost-effective solution for large-scale dataset construction and domain adaptation.
Paperid: 4033,   Poster  
Authors: Chao Zhang, Fang Liu, Shuo Li, Yang Liu, Jiahao Wang, Xinyan Huang, Lingling Li, Puhua Chen, Xu Liu, Wenping Ma, Siqi Yu
Title: VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
Abstract: Textdriven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose VDFE, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE leverages precise localization to selectively update 3D Gaussians for efficient and precise refinement. Extensive experiments demonstrate that our method significantly outperforms existing methods in both qualitative and quantitative evaluations, achieving state-of-the-art (SOTA) performance.
Paperid: 4034,   Poster  
Authors: Dominik Hollidt, Tommaso Bendinelli, Christian Holz
Title: Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
Abstract: Methods using inertial measurement units (IMUs) provide a wearable alternative to camerabased motion capture.To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions.However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction.We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints.It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements.These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion.Still, network predictions can violate inter-sensor distance measurements.To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling.Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.Code will be released upon acceptance.
Paperid: 4035,   Poster  
Authors: Hakan Emre Gedik, Shashank Gupta, Alan Bovik
Title: Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
Abstract: Noreference image quality assessment (NR IQA) has recently benefited from deep and multimodal models, yet many SOTA systems still violate at least one basic requirement: they either discard critical quality cues via aggressive resizing, fail to generalize across resolutions, cannot be jointly trained on heterogeneous IQA datasets with mismatched MOS scales, or require prohibitive computation. We present ReLIQS, a model for Resolution-agnostic Learning for Image Quality with Saliency, which is resolution-agnostic, preserves original-resolution quality cues, learns from multiple subjective studies, and remains computationally efficient and budget-adaptive. ReLIQS is a CLIP-based multiscale patch-driven architecture that learns both \emphwhere to look and \emphhow to judge quality. Fixed-size patches are sampled across multiple resolutions, including the original resolution, and encoded with a CLIP vision backbone. A lightweight Perceptual Importance Estimator then predicts IQA-specific importance maps to select a small set of informative patches, and a Quality Aspect Module aggregates their embeddings into a single image-level score. Across authentic, synthetic, and AIGC benchmarks spanning diverse resolutions and distortions, ReLIQS generalizes better than strong CNN-, CLIP-, and MLLM-based baselines with matching or reduced computational cost.
Paperid: 4036,   Poster  
Authors: Haifeng Zhong, Wenshuo Han, Zhouyu Wang, Runyang Feng, Fan Tang, Tong-yee Lee, zipei fan, Ruihai Wu, Yuran Wang, Hao Dong, Hechang Chen, Hyung Jin Chang, Yixing Gao
Title: GraspALL: Adaptive Structural Compensation from Luminance Variation for Robotic Garment Grasping in Any Low-Light Conditions
Abstract: Achieving accurate garment grasping under dynamically changing illumination is crucial for allday operation of service robots. However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness. Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model’s adaptability to illumination changes. To address this problem, we propose GraspALL, an illumination-structure interactive compensation model. The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature compensation between RGB and non-RGB modalities, thereby generating illumination-consistent grasping representations. Experiments on the self-built multimodal garment grasping (MIGG) dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baseline methods under diverse illumination conditions.
Paperid: 4037,   Poster  
Authors: HyunsooHan HyunsooHan, Sangyeop Yeo, Jaejun Yoo
Title: LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
Abstract: We demonstrate that in knowledge distillation for diffusion models, the teacher network’s highly complex denoising process—stemming from its substantially larger capacity—poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarseto-fine distillation framework with LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into acoarse'' alignment and afine'' refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, LInear FiTting-based distillation extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance.Our comprehensive experimental results demonstrate that ours, \core~with \pick, outperforms previous knowledge distillation on diffusion models based on both U-Net and DiT architectures. Furthermore, as compression rates become exceedingly high, conventional knowledge distillation fails to provide sufficient guidance, thereby preventing lightweight diffusion models from achieving stable training. In contrast, our method demonstrates stable convergence even under such extreme compression ratios.
Paperid: 4038,   Poster  
Authors: Renye Yan, Jikang Cheng, Shikun Sun, Yi Sun, You Wu, Wei Peng, Zongwei Wang, Ling Liang, Junliang Xing, Yimao Cai
Title: Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
Abstract: Diffusion models have achieved outstanding success in image generation, yet their objectives are often limited to reconstruction, making it difficult to align with human preferences directly. Reinforcement learning (RL) offers a promising approach to address this by optimizing models using explicit reward signals. However, most studies apply RL across the entire denoising process, which is both computationally expensive and tends to weaken preference alignment, i.e., doing more but achieving less. We observe that the impact of RL finetuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action–reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose \ourmethod, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, \ourmethod adaptively identifies the optimal timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare `dual benefit': a reduction in computational costs alongside a significant performance improvement. Theoretical analysis from an entropy perspective and extensive experiments verify our claims: compared with state-of-the-art methods, \ourmethod improves performance by xx% while cutting computational cost by xx%.
Paperid: 4039,   Poster  
Authors: Yuwei Zhou, Guoyu Lu
Title: Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
Abstract: This paper introduces a novel application of machine learning in agriculture for nondestructive 3D root structure reconstruction. Plant roots are critical for providing resources for the entire plant. Ground Penetrating Radar (GPR) is a key tool for identifying subterranean objects with easy and obvious shapes, such as large pipes, but remaining challenging to assess the 3D shapes of roots. In our study, we introduce a novel approach specifically designed based on GPR signal shape priors to detect target signals and perform curve parameter regression based on multiple B-scans from GPR. This process enables the derivation of a precise curve from the detection and regression outcomes. To achieve the reconstruction of a comprehensive 3D root structure, we have developed a shape reconstruction network that processes sparse sliced 3D points through a dedicated point graph network and an upsampling network module. Our method has been rigorously trained and validated using synthetic 3D root datasets and GPR data simulated by gprMax, as well as real GPR data.
Paperid: 4040,   Poster  
Authors: Xiaolong Li, Lan Yang, Ruyang Li, Shan Fang, Yang Liu, Xiangmo Zhao
Title: EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving
Abstract: Endto-end driving frameworks that directly map raw sensor data to vehicle control commands have shown remarkable potential. However, their performance often deteriorates in sparse-critical scenarios, where rare but safety-sensitive events occur. To address this problem, we propose Explorer-Expert Reinforcement Learning (EE-RL), a novel end-to-end framework that integrates an RL-based explorer, a fine-tuned vision-language model (VLM)-based expert, and a dual replay buffer. EE-RL adopts a collaborative learning strategy in which the explorer and experts jointly generate experiences from regular driving scenarios to guide policy learning. As training progresses, a dedicated VLM expert focuses on reasoning about sparse-critical scenarios, therefore enhancing learning efficiency and policy optimization in both scenarios. Additionally, the StateHash algorithm is designed to measure RGB-image and kinematic-data similarity, thereby skipping unnecessary VLM reasoning and enabling denser, more effective expert experience generation. Extensive experiments on the CARLA Leaderboard demonstrate that EE-RL significantly outperforms state-of-the-art (SOTA) baselines, achieving +19.82% and +20.98% improvements in driving and infraction scores on Town03, respectively. The EE-RL further achieves 0% accident probability in the red-light running and get an average driving score of 80.09 in the generalization towns (Town05–06), demonstrating its strong capability in addressing sparse-critical scenarios as well as its robutness and generzlization.
Paperid: 4041,   Poster  
Authors: Keyao Wang, Shuai Liu, Hengda Shi, Lukui Shi, Chenhaiyong Chenhaiyong
Title: Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
Abstract: RGBEvent object detection is able to capture clear and detailed features of the target while maintaining high-speed information collection.It is suitable for high dynamic or harsh environments and has become a research hotspot in recent years. The existing RGB-Event object detectors all struggle to fully utilize the fusion features of two modalities, but ignore the independent role of single features. To fully tap into the potential of single features, we propose a frequency-domain coherence-based Shared and Private Features Decoupling method for RGB-Event object detection method, SPFD network. First, we design a FCFS module to separate shared and private features by exploring the spectral energy distribution differences between dual modalities. Then, we design a TriAdapt Encoder to process the shared and private features, selectively emphasizing texture-rich RGB features in static regions and motion-sensitive event features in dynamic regions, thereby achieving a robust balance between spatial detail and temporal awareness. Finally, a TriInject Decoder is proposed to emphasize the most discriminative modality features dynamically. Experimental results on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that our model achieves competitive performance with state-of-the-arts.
Paperid: 4042,   Poster  
Authors: Zhanbo Huang, Dingqiang Ye, Xiaoming Liu, Yu Kong
Title: Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition
Abstract: Existing setbased gait recognition methods achieve remarkable performance by capturing global semantic context.However, their order-invariant nature prevents them from modeling the fine-grained kinematic patterns that unfold over time.To unify the global and process-level representations, we propose GaitMax, a framework that captures both semantic context and kinematic motion.GaitMax leverages attention-based spatiotemporal modeling to dynamically represent detailed part-level trajectories.While this detailed representation is more powerful, it also captures more nuisance factors (e.g., clothing, viewpoint), leading to potential shortcuts.To mitigate this, we introduce CDLoss, a Conditional Decorrelation Loss that explicitly disentangles the gait embeddings from nuisance factors using vision-language supervision.This loss requires high-quality nuisance descriptions. We therefore construct GCaption, a new resource that provides natural language annotations for multiple gait datasets, moving beyond simple categorical labels. GCaption not only enables CDLoss but also serves as a foundation for future context-aware gait analysis.The superiority of GaitMax is validated through extensive experiments on multiple large-scale gait benchmarks. Models, code, and resources will be released upon publication.
Paperid: 4043,   Poster  
Authors: Xuzeng Li, Tao Zhang, Xiangyun Tang, JIACHENG WANG, Jian Wang, Jiawen Kang, Jiqiang Liu, Zhen Han, Dusit Niyato, Dong In Kim
Title: Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
Abstract: Federated learning (FL) enables a central server to collaboratively train a global model with multiple clients while preserving data privacy. However, the distributed nature of FL makes the paradigm vulnerable to backdoor attacks, as proved by numerous recent studies. Although existing studies improve the effectiveness of backdoor attacks through optimized triggers, they have two limitations: (1) they ignore the heterogeneous contribution of individual model layers to the success of a backdoor; (2) they induce conspicuous differences between backdoor and clean models in the early stages of poisoning. The limitations cause backdoor models to exhibit significant discrepancies from clean models, making them easily detectable. To fill these gaps, we propose LaySelFL, a novel layerselective method to eliminate distance differences induced by the backdoor to conceal attacks in FL. Our central insight is that different layers contribute unequally to backdoor attacks, by localizing poisoning to layers that are most sensitive to backdoor objectives, an attacker can reduce the model differences substantially between the backdoor and clean models. Concretely, LaySelFL identifies sensitive layers via both dynamic and static evaluations of parameter differences between backdoor and benign models, and then applies a targeted training protocol and a regularized loss that constrains differences from the global model in each round. Finally, LaySelFL performs clipping on non-poisoning layers to further mask residual differences introduced by the attack. This strategy yields a more covert and resilient backdoor attack. Extensive experiments show that LaySelFL increases the effectiveness of attacks by 25% and reduces the effectiveness of defense methods to 4%.
Paperid: 4044,   Poster  
Authors: Fei Ni, Zhuo Chen, Yifu Yuan, Zibin Dong, Xianze Yao, Shan Luo, Jianye Hao, Jiankang Deng, Stefanos Zafeiriou
Title: SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning
Abstract: VisionLanguage-Action (VLA) models have emerged as a promising paradigm where pretrained Vision-Language Models (VLMs) serve as System 2 for high-level reasoning, connected to action experts as System 1 for low-level motor control.However, current works fail to genuinely leverage VLM capabilities: VLMs produce latent embeddings that lack semantic interpretability, providing ambiguous and unstable guidance to downstream policies, while solely action supervision further causes VLMs to degenerate into parameter-heavy fusion encoders that memorize action patterns rather than perform generalized reasoning.To bridge this gap, we introduceSemanticVLA, which leverages VLM reasoning through synergistic dual-path design.Explicit trace reasoninggenerates interpretable spatial waypoints as textual coordinate sequences through the VLM's native language interface, directly reusing its pretrained spatial grounding to provide a "thinking process" for task planning.Latent action tokenscomplement trace reasoning by learning compact visuomotor primitives grounded in visual observations, providing more fine-grained action representations beyond pure coordinate prediction. This synergy enables trace reasoning to leverage VLM's multimodal understanding for refining latent token prediction, while latent tokens provide stable and grounded guidance that compensates for trace's numerical sensitivity.SemanticVLA achieves 97.0% average success rate on LIBERO and 65.1% on SimplerEnv WidowX, substantially outperforming strong baselines. More importantly, SemanticVLA maintains significantly more stable performance under instruction rephrasing in both simulation suites, and demonstrates strong advantages on real-world long-horizon and reasoning-intensive tasks.By bridging VLM reasoning and action expert through semantically explicit trace and visually grounded latent action tokens, our approach enablesgenuine reasoningrather thanaction memorization.
Paperid: 4045,   Poster  
Authors: Hui Tang, Yifan He, Zhong Jin
Title: D$^3$FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Noise
Abstract: Facial expression recognition (FER) in the wild remains a challenging task due to the coexistence of data noise and label noise. While existing methods often address one type of noise in isolation, they struggle to achieve robust performance under the compound effects of both. To this end, we propose D^3FER (Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Noise), a unified framework that simultaneously tackles data and label noise in a single architecture. D^3FER introduces a dualchannel augmentation strategy, pairing weakly and strongly augmented views, to facilitate reliable pseudo-label generation and noise-aware training. Coupled with a dynamic queue mechanism, it adaptively estimates a noise threshold based on historical prediction confidence, enabling automatic identification and correction of label noise. Furthermore, inspired by contrastive learning, we design a momentum-updated Query-Key dual-branch structure that enhances intra-class compactness and inter-class separability, thereby improving robustness to data noise. At inference time, the stable Key branch parameters are leveraged to ensure consistent and generalized predictions. Extensive experiments on major in-the-wild benchmarks demonstrate that D^3FER outperforms state-of-the-art methods, setting new records in both accuracy and robustness under realistic, noisy conditions. The source code is available at https://github.com/D3FER/D3FER.
Paperid: 4046,   Poster  
Authors: Guangyang Wu, Youran Ding, Xinyu Che, BENYUAN SUN, Yi Yang, Xiaohong Liu
Title: Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields
Abstract: Trackingany-point (TAP) answers query-conditioned correspondence but leaves the dense, all-pairs structure of a video implicit. We formulate All-Pairs Tracking (APT): given a video, predict dense displacement and visibility for every source-target frame pair, from which per-pixel trajectories can be read out.To this end, we propose PairFormer, a feed-forward transformer that addresses APT in a single pass. A spatio-temporal patch encoder computes temporally conditioned features for all frames. CorrBank constructs a learnable correlation memory for each frame pair to obtain pairwise motion tokens. A broadcast motion mixer aggregates information across space and time and refines these tokens with global context. A trajectory head then predicts full-resolution displacement for each pair, yielding a coherent all-pairs trajectory field.To support APT at scale, we develop PAIRender, a data platform that synthesizes photo-realistic dynamic scenes with dense annotations. From PAIRender we derive a training set (\pi-R10K) and a benchmark (APT-Bench) with an all-to-all evaluation protocol. Experiments show that PairFormer achieves strong performance on APT-Bench and competitive results on standard TAP benchmarks. Code and dataset will be released upon publication.
Paperid: 4047,   Poster  
Authors: Zichuan Wang, Songlin Yang, Bo Peng, Zhenchen Tang, Yang Li, BeibeiDong BeibeiDong, Jing Dong
Title: Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination
Abstract: Large VisionLanguage Models (LVLMs) often suffer from object hallucination, generating objects that are absent from the image. Prior work largely attributes this to insufficient visual attention. However, in this work, we are surprised to find that both real and hallucinated objects receive equally strong visual attention in the model’s mid-to-late layers. This indicates that the key issue may not be how much the model attends, butwhat it attends to and why. To this end, we decode the visual features of high-attention regions using Logit Lens, and observe that high-attention regions corresponding to real objects can be correctly decoded to the target object token, whereas those for hallucinated objects cannot. Building on this, we identify two distinct hallucination mechanisms:(i) visual uncertainty, triggered by semantically similar or confusable regions, masking these regions eliminates the hallucination.(ii) contextual prior, triggered by strong co-occurrence priors, even when the initially attended region is masked, the hallucination persists and attention drifts to other regions. Based on these findings, we propose a simple yet effective training-freeDetect–Mitigate frameworkcomprising a Logit-Lens Consistency Check to detect hallucination and targeted remedies: High-Attention Regions Masking (HARM) for visual uncertainty hallucination, and Visual Evidence Enhanced Decoding (VEED) for contextual prior hallucination, which leverages genuine visual evidence to suppress erroneous priors. Our approach achieves state-of-the-art results on multiple hallucination benchmarks.
Paperid: 4048,   Poster  
Authors: Ci Zhang, Zhaojun Ding, Chence Yang, Jun Liu, Xiaoming Zhai, Shaoyi Huang, Beiwen Li, Xiaolong Ma, Jin Lu, Geng Yuan
Title: Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
Abstract: Pruningbased unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts.To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining.Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.
Paperid: 4049,   Poster  
Authors: Zihan Huan, Xipeng Pan, Hualong Zhang, Siyang Feng, Rushi Lan, Huadeng Wang, Haoxiang Lu, Zhenbing Liu
Title: Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
Abstract: Nuclei instance segmentation in histopathology images is essential for diagnostic accuracy and downstream computational tasks, yet this task relies heavily on expensive pixel level annotations. Although point level annotations substantially reduce the annotation burden for pathologists, many existing methods utilize only a single type of image and overlook the complementary information contained in alternative representations. To address this limitation, we propose DFGNet, a weakly supervised framework that utilizes dualrepresentation complementary fusion and interleaved guidance learning by jointly modeling RGB images and their corresponding Hematoxylin components. From the complementary fusion perspective, we propose a Reciprocal Cross-scale Dynamic Fusion Module. (RCDF) and an Entropy Confidence Aggregation Unit (ECAU) to integrate multi-scale complementary cues and adaptively combine the outputs of the dual branches. In terms of interleaved guidance, we also propose an Interleaved point-Guided Attention (IGA) that enables bidirectional refinement between the segmentation task and the kernel prediction task. Extensive experiments on three benchmark datasets show that DFGNet achieves state-of-the-art performance across multiple metrics and significantly outperforms existing approaches. DFGNet also demonstrates strong generalization ability across different tissue types and exhibits remarkable robustness to annotation shifts, providing a low-cost and scalable solution for practical clinical applications.
Paperid: 4050,   Poster  
Authors: Guanghui Ye, Huan Zhao, Zhixue Zhao, Tengfei Ma, Kehan Wang, Steffen Eger, Zhihua Jiang
Title: SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Abstract: Scientific images often require accurate numerical representations and correct object attributes, making them differ significantly from reallife images. However, existing faithfulness metrics for image generation or interpretation with large multimodal models (LMMs) focus mostly on real-life images, which makes them ill-suited for scientific image evaluations. For this, we proposeSCIEval, a novel and unified faithfulness metric specifically designed forSCientificImageEvaluations. First, to fully capture faithfulness, we introduce three key aspects: (i) Relevance (R) which measures the overall text-image correspondence, (ii) Accuracy (A) which examines the details of scientific objects, and (iii) Explainability (E) which reveals unfaithful elements in the generated content. Consequently, we generate aspect-aware scientific text-image data to train three sub evaluators (SCIEval-R/A/E). Specifically, to train SCIEval-R and SCIEval-A, we propose a new SciCLIP framework, where we improve the scientific image perception of CLIP text and visual encoders via intra- and cross-modal contrastive learning. Meanwhile, to train SCIEval-E, we finetune a strong LMM using supervised rationale samples. Moreover, we present SCIEval-Bench, a human-annotated evaluation benchmark across 8 scientific domains, consisting of 3,000 scientific text-to-image samples from 4 LMMs (for image generation) and 3,000 scientific image captioning samples from 4 LMMs (for image interpretation). Experiments on SCIEval-Bench demonstrate that our SCIEval is more reliable and better correlated with human ratings compared to 24 competitors, including GPT-4o.
Paperid: 4051,   Poster  
Authors: Zian Cao, Wei Wei, QINGSHAN GAO, Yuanyuanfu Yuanyuanfu
Title: RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection
Abstract: Change detection and semantic segmentation are key techniques for satellite image analysis in remote sensing. However, acquiring highquality labeled data is costly and time-consuming. Although recent studies have explored generative models to ease data scarcity, a unified framework supporting both tasks is still lacking, and most methods overlook noise accumulation and cannot generate multispectral images. To address this, we propose the robust diffusion framework for masked image generation (RDF-MIG). RDF-MIG generates bi-temporal change-labeled and single-temporal segmentation-labeled images to enhance downstream change detection and semantic segmentation tasks. Furthermore, to address noise accumulation and improve the quality of generated image–mask pairs, we reformulate the diffusion model training objective by proposing the Maximum Correntropy Robust Diffusion (MCRD) loss, and further design an MSE-consistency calibration that analytically aligns small-error gradients with the MSE objective while preserving robustness to outliers. Experiments indicate that the proposed RDF-MIG framework can generate multispectral image–mask pairs to improve downstream performance, while MCRD loss further enhances the quality of the synthesized data.
Paperid: 4052,   Poster  
Authors: Yujuan Zhang, Qing Li, Ziyu Li, Xiuxing Li, Zhuo Wang, Mengrui Xu, Xia Wu
Title: Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
Abstract: Random framelevel data missing is a critical challenge in multimodal sentiment analysis. Existing methods are largely limited to passive completion via single-pass feedforward connections and static cross-modal fusion, which struggle to generate high-quality completed features. However, the brain is not a passive recipient of external information but a dynamic system for active perceptual inference. Its core lies in the dynamic nested recurrents formed by intra-cortical recurrent completion mechanisms and corticothalamic circuits, which iteratively perform perceptual inference. Inspired by this, we propose the Dynamic Nested Recurrent Network (DNRNet). It is the first to introduce recurrent inference into the data completion task, achieving a paradigm shift from passive completion to active perceptual inference. Its local recurrent loop simulates intra-cortical recurrent pattern completion to perform perceptual inference and generate local correction features. The global recurrent loop simulates the modulatory function of the thalamus, calculating modality confidence to dynamically weight and integrate cross-modal information, generating global correction features. The local and global correction features are fused to obtain the completion signal, which is then combined with the input features of the current iteration to serve as the input for the next iteration. Experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that DNRNet achieves an average accuracy improvement of 1.5%–2.0% over baseline models across all missing rates, validating the superiority of the brain-inspired approach in complex missing data scenarios.
Paperid: 4053,   Poster  
Authors: Zhipeng Liu, Guilian Chen, Zheng Jiang, Huisi Wu, Jing Qin
Title: VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
Abstract: Automated 3D pulmonary vessel segmentation from CT images is crucial for improving early screening and assessment of pulmonary vessel related diseases. However, it remains an extremely challenging task due to the complex and treelike structures of vessels, large scale-variation, and the existence of highly similar tissues in the background. Existing segmentation models either cannot sufficiently capture long-range structural dependencies, which are of great importance in vessel segmentation, or are constrained by insufficient computational resources in clinical settings. In this paper, we propose VesMamba, a novel model for 3D pulmonary vessel segmentation that comprehensively addresses these challenges. Specifically, we first devise a spatial-gated structural perception (SSP) module, which employs Mamba to efficiently capture long-range dependencies. In SSP, we design dynamic spatial attention convolutions (DSAC) for dynamically learning the tree-like 3D vessel structures, providing Mamba with the spatial perception capability to better track the complicated topologies of vessels. Second, we propose an innovative bidirectional scale-aware filter (BSF) module to strengthen the representation capability of the encoder, facilitating our model to focus on vessels of different scales under noise. Moreover, we apply a mask-constrained decoder to further improve segmentation consistency and accuracy, which constrains the inference of adjacent low-layer decoders directly by high-layer masks. Extensive experiments on the public dataset Parse22 and the internal dataset Lung79 demonstrate that our method can achieve better performance than SOTAs. Codes will be released upon publication.
Paperid: 4054,   Poster  
Authors: Dachuan Zhao, Weiyue Li, Zhenda Shen, Yushu Qiu, Bowen Xu, Haoyu Chen, Yongchao Chen
Title: Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
Abstract: VisionLanguage Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose Subspace Projection Debiasing (SPD), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of \sys: our method achieves more robust debiasing with an average improvement of 18.5% across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
Paperid: 4055,   Poster  
Authors: Chengyu Zheng, Hanzhang Lu, Jie Nie, Shan Du
Title: TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
Abstract: In remote sensing (RS) crossmodal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because it fails to account for the cross-modal semantic overlaps and gaps. To address these challenges, we introduce TriSim, a novel image-text retrieval framework that constructs a tri-dimensional negative similarity space to mitigate the influence of FNS issue. Specifically, considering that FNS appear as anomalies in this space, Extreme Value Theory (EVT) is applied to model the statistical behavior of the tail distribution for FNS selection. Two complementary tail selection strategies are developed: one identifies samples distant from the dense ellipsoidal center, and the other targets upper-right high-similarity extremes. The selected tail samples are regarded as FNS and modeled using a generalized Pareto distribution, with probabilistic weights assigned in the triplet loss. To further refine the selected FNS, intra-modal saliency differences are computed to generate masks that guide the learning of a gain matrix, which amplifies highly discriminative regions and suppresses ambiguous ones. Extensive experiments on two benchmarks demonstrate the superiority of the proposed TriSim framework in mitigating the influence of false negatives in RS image-text retrieval.
Paperid: 4056,   Poster  
Authors: Pascal Chang, Kai Lascheit, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Title: What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models
Abstract: Optimizing noise latents in diffusion models is powerful for controllable generation, rewardguided sampling, and latent inversion, but the process is notoriously unstable. Without a principled regularizer, optimized latents drift away from the Gaussian prior, collapsing out of the typical set and producing severe artifacts. Existing constraints like norm-matching or simple KL divergence losses are often insufficient, as they fail to capture the full statistical properties of true Gaussian noise. We propose a principled, differentiable regularizer that correctly targets the high-mass typical set rather than the high-probability mode. Our energy function tractably approximates the KL divergence by matching low-order statistics. It combines a 1D marginal term to match the pixel-value histogram and a 2D spatial term to enforce decorrelation. By applying this in a multi-scale pyramid, our method penalizes correlations at all ranges, effectively projecting samples closer onto the true Gaussian typical set. We demonstrate its effectiveness for robust, artifact-free reward-guided generation and model-free latent inversion.
Paperid: 4057,   Poster  
Authors: Sunoh Kim, Daeho Um
Title: When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Abstract: Visionlanguage models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. For reproducibility, we provide our code in the Supplementary Material and will publicly release it on GitHub.
Paperid: 4058,   Poster  
Authors: Yi Yu, Libing Wu, Zhuangzhuang Zhang, Jing Qiu, Lijuan Huo, Jiaqi Feng
Title: All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
Abstract: Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of featurelevel sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.
Paperid: 4059,   Poster  
Authors: Yincheng Yao, Enze Shi, Shu Zhang
Title: Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams
Abstract: Understanding how the brain constructs coherent 3D visual percepts from multifaceted experiences remains a pivotal yet underexplored challenge. To investigate this, we introduce BrainSSD, a novel framework for decoding 3D representations from electroencephalography (EEG) signals. The core of BrainSSD is a neuroinspired fusion architecture, Hierarchical Phase-Amplitude Coupling guided Fusion (HPACF), which synergistically integrates EEG from two distinct viewing paradigms: brief presentations of a static 3D object view, and sustained observation of the object undergoing full rotation. HPACF embodies two key principles of neural computation, namely hierarchical processing realized through multi-level cross-attention, and neural synchrony actualized by using a differentiable estimator of Phase-Amplitude Coupling (PAC) to dynamically guide the integration. The resulting fused representations are subsequently mapped to the visual domain via a multi-level alignment loss. Our framework establishes a new state-of-the-art across a range of EEG decoding tasks, achieving superior discriminative power and exceptional generative fidelity. Furthermore, our static-dynamic dominance analysis provides the first direct visual evidence for a functional specialization in the brain's 3D perception, revealing that neural responses to static object views primarily underpin the object's holistic structure and form, while responses to rotational observation are indispensable for resolving its fine-grained geometric details. Our work presents an advanced framework for probing EEG-based visual decoding and offers computational insights into the brain's synergistic strategies for 3D perception.
Paperid: 4060,   Poster  
Authors: Jiali Chen, Yuqi Xue, Xusen Hei, DingBa Fu, wei yuancheng, Jiayuan Xie, Yi Cai
Title: EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models
Abstract: Large multimodal models (LMMs) have achieved impressive performance on multimodal reasoning, becoming crucial technology for the advancement of intelligent questionanswering systems. In real-world educational scenarios, effective teaching extends far beyond providing answers. Experienced teachers analyze students' incorrect answers to trace underlying errors and provide corrective feedback, termed educational diagnostic reasoning, a capability that remains under-explored in existing LMMs. To bridge this research gap, we introduce Edudiag benchmark, requiring LMMs to reconstruct erroneous reasoning chains from incorrect answers and generate corrective feedback. Through an AI-assisted annotation pipeline with rigorous human verification, we create 8K erroneous reasoning chains and corresponding feedback, spanning three representative educational domains: commonsense, science, and mathematics.Extensive evaluation across 28 leading LMMs highlights Edudiag as a challenging testbed, where even leading proprietary LMMs struggle on it and supervised fine-tuning (SFT) on open-source LMMs achieves marginal performance gains. Moreover, we conduct analysis experiments and identify three critical insights for educational diagnostic reasoning: (i) Effective error tracing remains the primary bottleneck, while SFT models still fail to reversely identify errors that commonly occur. (ii) Group relative policy optimization (GRPO) mitigates this bottleneck and boosts performance. (iii) LMMs optimized with GRPO can generate plausible yet challenging distractors for multiple-choice questions based on their self-constructed erroneous reasoning chains. We believe Edudiag provides a new direction for evaluating the advanced LMMs.
Paperid: 4061,   Poster  
Authors: Jinheng Ji, Jiahui Qu, Wenqian Dong, Yunsong Li
Title: CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification
Abstract: Finetuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion Interactive Prompt Tuning (CF-IPT) method to fine-tune CLIP for multi-source RS image classification tasks. It aims to leverage the prompt learning framework to transfer the alignment target of the text branch shifts from natural images to multi-source RS images. Specifically, we design a Multi-source Interactive Fusion–guided Spectral-Spatial Prompt Generation (MFPG) module, which enables cross-modal feature interaction to generate a prompt matrix that preserves the original spectral and spatial information while performing adaptive multi-scale fusion to address the multi-source image adaptation problem. Subsequently, a Spectral–Spatial Prompt–guided Visual–Text Prompt Interaction (V-TPI) Strategy is proposed, which leverages spectral–spatial prompt matrices to guide visual–textual prompt interaction and inject RS–specific information into both branches of CLIP, ultimately enabling multi-source RS image–text representation alignment. The proposed approach performs the downstream task of multi-source RS image classification with merely 0.76% of CLIP’s parameters. It is evaluated on several widely used datasets, demonstrating the effectiveness of the proposed approach.
Paperid: 4062,   Poster  
Authors: Jiayi Wang, Zhihong Tan, Hongchen Wei, Daiqin Yang, Zhenzhong Chen
Title: See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions
Abstract: Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or lowlight conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce GROC, the first large-scale dataset Geo-guided Reasoning in Object Counting under adverse earth observation conditions. GROC contains 1.2 million point annotations over 14K images, each aligned with 3 modalities that preserve original geospatial information. We also provide a data engine to collect a large-scale object counting dataset with multiple geo-modalities, realistic degradations, and reliable annotations. We further present an counting agent that adaptively leverages geo-modalities to produce reliable estimates. Extensive experiments show that existing models struggle to “see” through adverse conditions, whereas geo-modalities improve robustness. GROC establishes the first benchmark that explicitly challenges models to see what they cannot see, charting a new direction for geo-guided amodal reasoning in earth observation.
Paperid: 4063,   Poster  
Authors: Tao Li, Xingran LIAO, Mingliang Zhou
Title: DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images
Abstract: The development of AIgenerated technology requires effective image quality assessment (AGIQA) methods to jointly evaluate visual quality and text-content alignment, ensuring that the generated content is both visually appealing and faithful to the user's instructions. Nevertheless, visual degradation and text-content misalignment often coincide, and it is difficult to tell whether a bad subjective evaluation arises from prompt noncompliance or rendering artifacts. As such, disentangling image content and rendering distortions is vital. We propose the dual-prior guided fusion network (DPGF-Net), which leverages image-side priors to disentangle distortions from content and combines them with text-side prompt templates to simulate their interactions, to address this issue. DPGF-Net employs a local text-conditioned aggregation branch to highlight semantically relevant and quality-sensitive regions in conjunction with a global modulation branch that captures holistic perceptual characteristics. Finally, adaptive fusion produces a single score. Experiments on three AGIQA datasets demonstrate that our method is highly correlated with human judgments, with lower prediction error and stable evaluation behavior. The code will be released upon acceptance.
Paperid: 4064,   Poster  
Authors: Bo Sun, Junxi Chen, Zhe Wu, Feng Gao, Fan Yang, Li Su, Yaowei Wang
Title: Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection
Abstract: Weaklysupervised Video Anomaly Detection (wVAD) aims to detect abnormal events using only binary labels, making it challenging to capture both the diversity of anomalies and their shared semantic cues. Existing methods either focus on a generic anomaly pattern, achieving strong generalization but weak discrimination, or rely on class-level diversity modeling, which ignores shared semantics and suffers from limited generalization. To overcome these limitations, we propose the Mixture of Memory Experts (MoME), a unified framework that jointly learns general and diverse patterns. Each expert in MoME possesses an internal memory for fine-grained specialization and shares an external memory for general knowledge aggregation. To enhance semantic diversity and improve generalization beyond coarse class-level supervision, we introduce an Anomaly Prototype Router that leverages large language models to construct generalized anomaly prototypes for semantically guided expert routing. Moreover, the regularization loss for APR ensures balanced routing, the distinctiveness loss for experts encourages diversity, and reconstruction together with memory tasks enhance pattern discriminability. Extensive experiments on UCF-Crime and XD-Violence demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of jointly modeling generality and diversity for robust anomaly detection under weak supervision.
Paperid: 4065,   Poster  
Authors: Xiaowen Liu, Jing Li, Hongtao Huo, Haozhe Cao, Renhua Wang, Xu Dong
Title: More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
Abstract: Existing image fusion methods face difficulties in adapting to unseen fusion tasks and have limitations in balancing semantic information with pixellevel details. This limitation can be attributed to three key challenges: (1) the lack of a unified, task-agnostic optimization objective; (2) the inherent difficulty in balancing semantic fidelity and pixel-level richness; and (3) an over-reliance on supervised learning, which limits transferability across tasks. To overcome these issues, this work proposes a unified fusion framework that generalizes to diverse fusion tasks even when trained solely on infrared–visible image pairs. Specifically, inspired by the free-energy principle, we introduce a fusion paradigm that combines high pixel-entropy expectation with low semantic-entropy expectation, and we design a frequency-aware feature decoupling mechanism to balance semantic content and pixel detail. Furthermore, an unsupervised dual-path trade-off strategy provides collaborative constraints at both semantic and pixel levels. Experiments show that our method significantly outperforms existing state-of-the-art methods in visual quality and downstream-task performance. It not only handles trained tasks efficiently but also generalizes well to unseen fusion tasks, while featuring lightweight model parameters and strong practical applicability. Code and data will be made publicly available.
Paperid: 4066,   Poster  
Authors: Jiahua Bao, Siyao Cheng, Jiaxing Du, Qingtao Xia, Changjiang He, Zeming Lang, Jie Liu
Title: Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
Abstract: With the rapid development of VisionLanguage Models (VLMs), there is a growing demand for automatic analysis of structured visual data. Charts and tables are primary carriers of quantitative information, with regular layouts and explicit numbers. However, current general VLMs and expert models make limited use of these chart-table features during training and inference. Another challenge is cross-format conversion in realistic settings, as chart and table outputs span Python and LaTeX and most VLMs struggle to handle this breadth reliably. These gaps often lead to analysis mistakes, and unreliable generation text. To overcome these limitations, we propose \underline\textttTwin-T, a two-stage expert VLM for comprehensive char\underline\textttt-\underline\texttttable tasks across Image, LaTeX, and Python. In stage 1, we propose a novel dual-head image encoder that can separate structural cues and fine details from input images. In stage 2, we propose MINT, a preference learning method that emphasizes numbers and keywords fidelity and vision–text matching. Furthermore, we introduce a comprehensive TwintVQA benchmark with 17 chart types, 11 task types, 3 data formats and short / medium / long QA settings. Our model narrows the gap between open-source and closed-source models on mainstream chart–table benchmarks, outperforming open-source models and GLM-4.5V-106B while even remaining competitive with GPT-4o and Gemini-2.5-Pro. Our code and additional details are available in the Appendix.
Paperid: 4067,   Poster  
Authors: Zain Shabeeb, Daniel Saeedi, Darin Tsui, Vida Jamali, Amirali Aghazadeh
Title: cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
Abstract: Cryoelectron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5× while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling.
Paperid: 4068,   Poster  
Authors: Si-Sheng Yang, Chia-Hsiang Lin
Title: Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
Abstract: The European Space Agency's Sentinel2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA's AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: ``Can a global hyperspectral coverage be achieved by reconstructing Sentinel-2 data to NASA hyperspectral images?'' This study aims to achieve spectral super-resolution from 12-to-186 and unify the spatial resolution of Sentinel-2 data to 5 m. To enable a reliable and efficient reconstruction, we formulate a novel deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, instead of relying on implicit deep priors as conventional deep unfolding does. Moreover, an adversarial term is integrated into the unfolded architecture, enabling the discriminator to guide the reconstruction in both the training and testing phases; we term this novel concept unfolding adversarial learning (UAL). Experiments show that our UALNet outperforms the next-best Transformer in PSNR, SSIM, and SAM, while requiring only 15% MACs and 20 times fewer parameters.
Paperid: 4069,   Poster  
Authors: linkun fan, Jiahao Zhang, JTzhang JTzhang, Lei Zhang, Fazhi He, Daojun Han
Title: Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
Abstract: To ensure the robustness of 3D point cloud Deep Neural Network(3D DNN), 3D adversarial attack targeting the inference stage and backdoor attack targeting the training stage are well studied. The success of both attacks usually requires a specified permissions that attacker must have. However, the obtainable permissions are uncertain due to the deployment environment changes in practical scenarios. This renders existing separately designed adversarial attack or backdoor attack ineffective. To solve this issue, this paper proposes a unified attack that can adapt to both 3D point cloud backdoor attack and adversarial attack, named UAtt3D. Furthermore, by observing existing attacks, their way to promise attack stealthiness is to limit the undesirable perturbation. This strategy requires moving the point position as little as possible, which restricts the attack intensity and is not suitable for our unified attack. Meanwhile, this strategy will inevitably cause a quality decrease on 3D point cloud due to the remaining malicious perturbation. Therefore, our UAtt3D explores a new avenue to guarantee attack stealthiness which improves the quality of attacked 3D point cloud rather than decreasing it. In detail, to simultaneously consider feature movement of adversarial attack and backdoor feature learning of backdoor attack, a flexible isotropic resampling is designed. It realigns the position of most points based on surface approximation and rays sampling. By fine tuning the resampled point cloud, adversarial point cloud and backdoored point cloud are obtained. Several experiments suggest that the proposed UAtt3D achieves outstanding stealthiness comparing with existing adversarial attacks and backdoor attacks from the subjective and objective perspective. Meanwhile, its attack efficiency is competitive.
Paperid: 4070,   Poster  
Authors: Yusen Cai, Qing Lin, BHARGAVA SATYA NUNNA, Mengmi Zhang
Title: Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Abstract: Newborns perceive the world with lowacuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged ``visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)—collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture–shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm.All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants’ visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.
Paperid: 4071,   Poster  
Authors: Xingsong Ye, Yongkun Du, Jiaxin Zhang, Chen Li, Jing LYU, Zhineng Chen
Title: What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
Abstract: Largescale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduceUnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then constructUnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.Code is provided in the supplementary.
Paperid: 4072,   Poster  
Authors: Arnav Devalapally, Poornima Jain, Kartik Srinivas, Vineeth Balasubramanian
Title: $\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$ : Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
Abstract: The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive sourcedomain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting calledSCADA-UL:UnlearningSource-exclusiveClAsses inDomainAdaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization.We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown.Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets.
Paperid: 4073,   Poster  
Authors: Guillaume Duret, Danylo Mazurak, Florence Zara, Jan Peters, Liming Chen
Title: Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
Abstract: While 2D vision has been revolutionized by largescale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object—a 5–20× speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchmark, culminating in 153,000 aligned meshes—a >40× increase in instances per category over previous aligned real-world datasets. Extensive evaluation demonstrates competitive zero-shot sim2real transfer on the NOCS 6D pose benchmark and superior robotic grasping performance in both simulation and real-world zero-shot transfer, where aligned meshes prove essential for success. We release the largest publicly available aligned 3D mesh dataset, largest category-level 6D pose dataset, grasping simulation environments, and open-source pipeline, providing a critical step toward foundation models for 3D understanding and enabling efficient, unlimited generation of task-specific 3D data from scratch.